February 9, 2016Open Access

Under‐Appreciated Steps in Instrument Development, Part I: Starting With Validity

Key Points

Key points are not available for this paper at this time.

Abstract

RINAH receives quite a few instrument development reports that are rejected without external review. A stellar factor analysis and internal consistency estimation may have been done, but this is not sufficient for publication here, because the validity testing process is incomplete. My purpose in this editorial, the first of a two-part series, is a brief review of some key steps at the beginning of the instrument development process, without which the validity of the measurement remains in question. The second editorial will focus on key steps focused on validity after the instrument has been drafted. None of these ideas is original; Froman and Schmitt wrote a similar editorial in this journal in 2003, Cook and Beckman reviewed them for a physician audience (2006), and Higgins and Straub (2006) made these points to a nursing audience in the recent past. I hope that another reminder of some steps many of us learned in graduate school will increase the number of publishable papers we receive. The most common item development strategies depicted in papers we receive are assembling a collection of items from existing instruments, and/or drafting new items based on existing literature, followed by eliciting “face validity” (subjective impression of transparency and relevance) responses and numerical content validity ratings of the items from content experts. For example, if I am developing a tool to measure discouragement in those with chronic illness, I might ask psychologists to identify the emotions associated with discouragement and generate items asking about the severity of these emotional symptoms. I might review measures of hopelessness or other related concepts and borrow items that appear related to discouragement. This is a common set of strategies, but more can be done, and I argue that more should be done to carefully conceptualize what the instrument will measure. Both asking experts and trolling for already-developed items have utility, but they carry a risk of re-creating the weaknesses and blind spots of past views of the concept. The approaches below can reduce this risk. By top-down strategies I mean using well-developed existing theoretical structures to determine the content of an instrument. This is sometimes called a deductive approach, logical partitioning, or classification from above (Hinkin, 1995). When the components of a concept have been defined in published theory or concept analysis, the researcher can use those sources to develop a list of the conceptual domains that should be assessed and estimate the proportion of items that should measure each domain. Some content areas might warrant more items and some fewer, depending on their centrality or complexity. In the education field, it is common to create a test blueprint (also called a test grid or table of specifications) to map out the instrument's conceptual domains. A test blueprint helps insure that all important aspects of the concept are included and provides a quick reference to the content of the tool (Sax, 1997). Researchers will want to develop multiple items for each component of a concept, because items that do not function well in later testing may need to be deleted. A test blueprint enables the researcher to check after item deletion to be sure the remaining items are appropriately distributed across domains. For my discouragement example, I will need to find theoretical work already done on this concept or do it myself. If no rigorous concept analysis of discouragement has been published, I can conduct one, following a systematic process. Generally these processes involve searching for and analyzing all uses and demonstrations of the concept in existing literature, and seeking additional evidence in newly generated data, perhaps interviews of clinicians or patients. After this work is done, I might find that discouragement in chronic illness is (hypothetically) composed of negative emotions including sadness and apathy, and of perceptions of one's situation including personal inability to change it and low likelihood of change by other influences. Both emotions and perceptions hypothetically are influenced by perceived duration and severity of the chronic illness, number and severity of concurrent or past challenges in health or other important life experiences, and degree of perceived success at managing or overcoming the challenges. With this systematically developed road map of domains of discouragement as the basis for a test blueprint, I can develop clusters of items to measure each of these domains. By these strategies I mean inductive, from-the-ground-up approaches to identify the scope and characteristics of the concept. Open-ended, minimally structured qualitative interviews and focus groups with members of the target population can generate exemplars of the experience of the concept in their lives. Caregivers, both professional and informal, may provide helpful ancillary views, but if the instrument will be completed by members of the target population, their input is important. Themes developed in qualitative analysis that characterize salient aspects of this experience can be used to map the domains of the test blueprint. Verbatim data can supply vocabulary that participants will recognize as depicting the experience. For the discouragement example, I may ask clinician colleagues to guide me to recruit a mix of patients who (in the clinicians’ view) range from very discouraged to not at all discouraged. I can interview these participants individually or hold focus groups to elicit what discouragement feels like, how they view the world and their future when discouraged or not, whether it is intermittent or constant, the events or experiences that cause it to increase or decrease, what their close companions tell them, and so forth. In doing so, I may discover an aspect of discouragement not previously described—perhaps it is the perceived reactions of others. When participants perceive that significant others react to the ill person's discouragement with anger, disappointment, loss of warmth in interactions, decline in support, and the like, discouragement may worsen; whereas others’ optimism, warmth despite the ill person's negativity, continued positive feedback, and help to achieve goals may lessen it. Reactions of others could be a new domain in my test blueprint. The most common form of item review is generation of a content validity index (CVI), which typically is done by presenting proposed items (and sometimes the test blueprint) to scholars who are experts in the concept, asking them to rate the appropriateness of each item on a scale of 1 to 4, and averaging the results for each item and overall. This is a top-down approach because we are asking experts rather than those experiencing the phenomenon. (We also receive reports in which experts have been asked to endorse an instrument's “face validity.” A thoughtful content validity evaluation serves the same purpose much more systematically and rigorously). I rarely see low CVIs. This may be all to the good, but sometimes I wonder, does this mean the raters simply endorsed past conceptualizations of the content, and the instrument has the same blind spots or misrepresentations of past tools? Does it mean that items with lower averages were deleted? If items with low scores were deleted, were important parts of the concept lost? To alleviate these worries for reviewers and readers, authors should report how each proposed item fared in this evaluation and how a CVI was calculated. In addition to a CVI for the whole measure (S-CVI), it is important to present the item-level CVI (I-CVI) for each item as well as a thoughtfully calculated I-CVI for the instrument (Polit, Beck, & Owen, 2007). To further strengthen the top-down content validation process, instead of or in addition to research-focused content experts, clinicians with frequent exposure to the concept of interest can be asked to complete a second set of CVI ratings to rate the relevance of the items to the target population, as well as the appropriateness of the literacy level and vocabulary, and identify aspects or domains not tapped by the drafted tool. For my discouragement example (which by now may be discouraging to readers), I could recruit clinicians working daily with chronic illness patients and ask them to complete the CVI. My preference would be to recruit clinicians who are not regularly exposed to published theories, in the hope that they will not only rate the relevance and user-friendliness of the items to the target population but will bring up aspects of the concept not described elsewhere, as they consider whether all the relevant characteristics of discouragement have been covered. Once an instrument is drafted, a great deal can be learned from a systematic assessment of its relevance and clarity to the target population. The response process of the target population to the instrument is part of overall validity (Cook & Beckman, 2006). Cognitive interviewing or think-aloud methods (e.g., Willis, 2004) elicit qualitative feedback from some of the target population while they are completing the instrument, commenting on its clarity and centrality to the measurement goal, and the clarity and appropriateness of the response options. Participants can be asked whether aspects of the concept being measured are missing, whether the vocabulary used to depict the experience under study is familiar or hard to identify with, and for any other reactions. With a sufficiently diverse sample of volunteers, the range of likely responses can be seen and adjustments made to response options if necessary. I am eager to receive more instrument development reports that include cognitive interviewing. For my discouragement instrument, the patients I interviewed earlier or a different group could be invited to think aloud while completing the new instrument either on paper or on a computer or tablet. I could ask questions as they go through the tool and audio-record their reactions. Perhaps they see a need for a response option that was not in the original plan, such as a middle-range choice to show they neither agree nor disagree with a statement or a not-applicable choice if they find an item irrelevant. Some domains might not apply to some participants. For example, the domain about significant others’ reactions to discouragement might be described as irrelevant by participants who live alone or are not partnered, but they may remark that health care providers’ reactions to the patient's behavior or to health indicators or test results are central to their experience. This comment may lead me to reword the section on perceptions of reactions of others to make it more general and applicable to a variety of others who have important influences on the respondent. Often, instruments are developed in order to measure a variable for a research project. When the instrument development process is preliminary to a larger study, it may seem tedious and unnecessary to conduct the steps above before administering it to a large sample. Nonetheless, what can be learned in these preliminary steps about the concept being measured and the best way to measure it should make the instrument much more user-friendly, accurate, and meaningful. The process also provides a valuable introduction to the clinical world under study and the recruitment and participation preferences of the target population. Most importantly for me as an editor, when a manuscript includes a description of these steps, I will be reassured that an effort was made to avoid simply recreating a researcher's assumptions about the concept or assumptions about how participants will understand and respond to its measurement. Once the instrument draft has been through the steps described above, it can be administered to a large and diverse sample and its internal structure and other characteristics can be explored. However, factor analysis and internal consistency reliability are insufficient to demonstrate the adequacy of a measurement tool. In the next editorial, I will focus on further validation steps that make an instrument research-ready. Absence of these steps is the most common reason for rejection of instrument development papers before review.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Margaret H. Kearney

University College Cork

Journals

Research in Nursing & Health

Actions

Institutions

University of Rochester

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Under‐Appreciated Steps in Instrument Development, Part I: Starting With Validity

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study