How specific is case specificity?

Share Embed


Descrição do Produto

case specificity

How specific is case specificity? Geoffrey Norman,1 Georges Bordage,2 Gordon Page3 & David Keane1

OBJECTIVES Case specificity implies that success on any case is specific to that case. In examining the sources of error variance in performance on casebased examinations, how much error variance results from differences between cases compared with differences between items within cases? What is the optimal number of cases and questions within cases to maximise test reliability given some fixed period of examination time?

CONCLUSIONS The main source of error variance was items within cases, not cases, and the optimal strategy in terms of enhancing reliability would use cases with 2–3 items per case. KEYWORDS *education, medical, undergraduate; educational measurement ⁄ *methods ⁄ standards; Ontario; sensitivity and specificity. Medical Education 2006; 40: 618–623

METHODS G and D generalisability studies were conducted to identify variance components and reliability for each examination analysed, and to optimise the reliability of the given test composition (1, 1.5, 2, 3, 4 and 5 questions per case), using data from 3 key features examinations of the Medical Council of Canada (n ¼ 6342 graduating medical students), each of which consisted of about 35 written cases followed by 1–4 questions regarding specific key elements of data gathering, diagnosis and ⁄ or management. RESULTS The smallest variance component was due to subjects; the variance due to subject–item interaction was over 5 times the interaction with cases (on average, 0.1106 compared with 0.0195). Relatively little variance was due to differences between cases; about 80% of the error variance was due to variability in performance among items within cases. The D study showed that reliability varied between 0.541 and 0.579, was least with 1 item per case and highest at 2 and 3 items per case.

1 Health Sciences Centre, McMaster University, Hamilton, Ontario, Canada 2 College of Medicine, University of Illinois at Chicago, Chicago, Illinois, USA 3 Faculty of Medicine, University of British Columbia, Vancouver, British Columbia, Canada

Correspondence : Geoffrey Norman PhD, Room 2C14, Health Sciences Centre, McMaster University, 1200 Main Street West, Hamilton, Ontario L8N 3Z5, Canada. Tel: 00 1 905 525 9140 ext. 22119; Fax: 00 1 905 577 0017; E-mail: [email protected]

618

doi:10.1111/j.1365-2929.2006.02511.x

INTRODUCTION Thirty years ago, expert clinical reasoning was thought to be a matter of the acquisition of Ôproblemsolving skillsÕ that expert clinicians possessed and students sought to acquire. Standardised evaluation methods, like patient management problems,1 evolved to measure these skills and were implemented in national licensing examinations. The idea seemed reasonable enough. Most of us would accept without proof a conjecture that expert cardiologists are better able to solve heart problems than, say, endocrinologists or paediatricians. Ample evidence exists that Tiger Woods plays better golf than most weekend golfers and Kasparov better chess than most. It does not appear to be a great leap of faith to assume that their superiority lies in their greater problem-solving skills. Yet, studies conducted in medicine and other domains in the late 1970s2–4 showed that successful solution of 1 problem was a poor predictor of whether an individual would successfully solve another problem. Typically, correlations across problems range from 0.1 to 0.3, regardless of the representation of the problem, the expertise of the problem solver, or the measure of success. The phenomenon was called Ôcase specificityÕ by Elstein et al.2 Since then, the finding has also been referred to as Ôcontext specificityÕ4 and Ôcontent specificityÕ.5

 Blackwell Publishing Ltd 2006. MEDICAL EDUCATION 2006; 40: 618–623

619

Overview What is already known on this subject Case specificity assumes that error variance due to cases should be high and error variance due to items within cases, low. What this study adds The present findings contradict that assumption. The performance across cases was poorly correlated, with relatively little variance due to differences between cases, and about 80% of the error variance due to variability in performance among items within cases. Our traditional assumption of case specificity is simplistic. The main source of variability is not due to cases but to the specific items nested in the cases. Suggestions for further research Similar results are likely to be found for examinations containing caselets, with questions nested in cases, and with objective structured clinical examinations, where checklist items are nested in cases.

None of these labels are necessarily at variance with our common notion of expertise; they just mean that to identify expertise, we will have to sample situations more broadly. Tiger Woods has bad days; a 0.333 baseball batter still strikes out 2 times out of 3; Kasparov loses chess games to inferior players (occasionally). It does suggest that success in solving a problem is not simply a matter of application of a general skill to a particular situation. However, these terms are not interchangeable. Content specificity implies that the variability across problems is a consequence of the variable content knowledge of the clinician, with the additional assumption that such knowledge is clustered within cases. Thus, someone who knows about myasthenia gravis will do well on all questions related to myasthenia gravis, but this may have a poor relationship with his or her performance on a different disease. There is some evidence consistent with this view. For example, studies of the relationship between

performance on a certification examination and management of acute myocardial infarction have shown that successful certificants have lower patient mortality rates than unsuccessful candidates, and subspecialists have lower patient mortality rates than internists.6 However, it could be that some other factor, as general as intelligence or as specific as individual patient experiences, ÔcausesÕ both high board certification scores and better patient management skills. There is some experimental evidence to substantiate the role of knowledge. In 1985, Norman et al. had residents work through a series of simulated patient problems where content was systematically varied, from 2 presentations of the same problem to problems in a different system.7 Correlations did drop monotonically as content became more disparate; still, it was striking that, even when the same case was presented on 2 occasions, the mean correlation of various measures was only 0.29 (range 0.07–0.60). This suggests that something other than content is contributing to the variability. One possibility is context. The literature in psychology suggests that recall of knowledge can be strongly influenced by the match between context at learning and context at retrieval. However, context is itself a vague term, encompassing all those circumstantial factors that should not influence knowledge retrieval, but do. One form of context specificity in medical problem solving has been identified; clinicians have been shown to be influenced by similarity to past patients, even on features that are objectively irrelevant to the diagnosis.8 Nevertheless, whatever contextual factors do influence, these should presumably act at the level of the whole case. So what of case specificity? This appears to simply restate the original finding in words – success on any case is specific to that case. In contrast to content specificity, there is no implied hypothesis about the reason for the observation. But there is an implication that the low correlation across cases reflects the fact that individual clinicians may have varying levels of content mastery in different cases. So retrieval of any knowledge within a case should be highly correlated with retrieval of other knowledge within the same case. Consequently, if we analysed a test containing modified essay questions9 or key features cases10 comprising a number of cases, each with several nested questions, we should find that the correlations among questions within a case will be high, but between cases, low. Or, putting the finding in the language of reliability and generalisability theory, the error variance due

 Blackwell Publishing Ltd 2006. MEDICAL EDUCATION 2006; 40: 618–623

620

case specificity to cases should be high and the variance due to questions within cases, low. The issue is of practical as well as theoretical interest. If the main source of error variance is cases, then test reliability can be increased by increasing the number of cases. Of course, if the number of questions per case remains fixed, this will also increase the total number of questions. However, within a fixed total examination time, an increase in the number of cases is likely associated with a reduction in the number of items overall, as additional cases require the allotting of additional time for reading the case stem. If increasing the number of cases does reduce the total number of questions in the examination and error is actually a consequence of questions rather than cases, the paradoxical result may emerge that an increase in the number of cases could reduce, rather than increase, test reliability. To examine the error variance attributable to cases and questions (hereafter items), we analysed the performance of candidates who sat the Medical Council of Canada (MCC) Qualifying Examination Part 1 over a 3-year period, comprising a total of approximately 6000 candidates. We focused on the key features component of the examination, which consisted of 28)39 cases, each with 1)4 individual questions. Analysis was conducted to estimate variance components due to candidates (subjects), cases and items within cases. Two research questions were addressed. 1 In examining the sources of error variance in performance on key features examinations, how much error variance results from differences between cases compared with differences between items within cases? 2 Based on the relative magnitude of error variance due to the 2 sources, what is the optimal number of cases and items within cases to maximise test reliability within some fixed period of examination time?

on the key features component of their qualifying examination for graduating medical students (also referred to as the clinical reasoning skills component). The key features component has been described in detail elsewhere.10,11 Basically, it consists of approximately 35 cases that are selected according to a test blueprint that reflects the proportion of patients seeking health services in 5 age groups (i.e. neonatal ⁄ pregnancy, children, adolescent, adult and elderly). Each case is comprised of a written case description of 2–3 paragraphs, followed by 1)4 questions regarding specific critical or essential elements (the key features) of data gathering, diagnosis and ⁄ or management. The questions are asked in Ôshort menuÕ (from about 5–15 options) or Ôshort answerÕ format. Scores at the individual question level are converted to a 0–1 scale (using partial credit scoring) and then averaged to give a case score. The average score across all cases is used in judgements of acceptable performance. The pass ⁄ fail standard for the key features component is set using a modified Angoff method. The database consisted of scores at the item level for each candidate, comprising a total of approximately 60 repeated observations per candidate. These responses were analysed with a general variance components program (urGENOVA12). The design variables were subject (S; candidate), case (C), and item (I; question) within case, and additional interaction terms (S · C, S · I : C). Of particular interest were the 2 interaction terms that capture the relative amount of error variance associated with case and item. The number of levels associated with each component, for each year of the examination, is shown in Table 1. From the variance components, we then computed the overall reliability of each test. This required a consideration of the number of cases (Nc) containing 1, 2, 3, or 4 items (shown in Table 1). The error variance in the intraclass correlation consisted of the sum of a component due to case variance: r2 ðS  CÞ=Nc

ð1Þ

and a second component due to items, which is the weighted sum of the number of 1-, 2-, 3- and 4-item cases:

METHODS

r2 ðS  IÞ=N2c  ½n1 þ n2=2 þ n3=3 þ n4=4

We obtained an anonymised database from the MCC containing details of the performances of 6342 candidates over 3 examination years (1997, 1998 and 1999) R¼

r2 ðSÞ

þ

r2 ðS

 CÞ=Nc þ

r2 ðS

ð2Þ

reflecting the fact that, for an n-item case, the error variance due to items is divided by the number of items in the case. The test reliability then, is: r2 ðSÞ  IÞ=N2c  ½n1 þ n2=2 þ n3=3 þ n4=4

ð3Þ

 Blackwell Publishing Ltd 2006. MEDICAL EDUCATION 2006; 40: 618–623

621

Finally, we conducted a D study, examining the relationship between number of cases and questions and the test reliability. We assumed, based on data from past administrations and pilot studies, that it would take a candidate 2 minutes to read each stem and 2 minutes to answer each test item. To constrain the problem further, we assumed that the test contained a fixed number of questions for each case. Thus, for example, with 1 question per case and 2 hours of testing time, the test would comprise 30 cases and 30 items (2 mins ⁄ stem · 30 + 2 mins ⁄ item · 30 ¼ 120 mins). For 3 questions per case, the test would comprise 15 cases and 45 questions (15 · 2 + 45 · 2 ¼ 120). The test reliability was computed for 1, 1.5, 2, 3, 4 and 5 items per case. The formula is a simplification of equation 3 that involves only a single multiplier of the S · I interaction corresponding to the number of items per case in the whole test.

year, the variance due to subject interaction with items was over 5 times the interaction with cases (on average, 0.1106 compared with 0.0195). Thus, the main source of error variance was items, not cases. It could be argued that this contrast is not strictly accurate as each case contains several items, so that the contribution of items to the unreliability will be divided by the number of items, as shown in the equations above. To permit a direct comparison, we computed the 2 error components in equations 1 and 2; these are shown in the middle of the table. It remains that items contribute substantially more error than cases. Approximately 80% of the error variance comes from items, and 20% from cases. As the results so far suggest that case variance is a small contributor to score error, and item variance a large contributor, then we could consider treating items as if they were not, in fact, clustered within cases. This would reflect a scoring approach where the scores on each item (0 or 1) were simply totalled, like items on a multiple-choice test, instead of averaged within case to create a case score, then totalled. In terms of the form of the G coefficient, the denominator, instead of containing a Subject variance and both S · C and S · I interactions, as before in equation 3, would only contain S · I interactions, and the G coefficient would be:

Table 1 Characteristics of the 3 key features examinations Year

Candidates n

Cases n

Number of cases with 1Q 2Q 3Q 4Q

1997 1998 1999

2060 2121 2161

28 31 39

7 15 18

13 8 16

8 7 5

0 1 0

Q ¼ question(s)



RESULTS

r2 ðSÞ

þ

r2 ðS

r2 ðSÞ  IÞ=ðn1 þ n2 þ n3 þ n4Þ

ð4Þ

where S · I in this analysis equals the sum of S · C + S · I : C in the former analysis.

The critical issue is the relative magnitude of the subject variance, and the 2 error terms S · C and S · I : C. In each of the years analysed, the smallest variance component was due to subjects (Table 2). Critical to the issue of content specificity, in each

Comparing equation 4 with equation 3, the only difference is that the S · C term in the original coefficient is divided by the number of cases in equation 3 and number of items in equation 4. Thus,

Table 2 Variance components for 3 years of MCC key features examinations Component

1997

1998

1999

Mean

%

Subject (S) Case (C) Item : case (I : C) S·C S·I:C Error (case) Error (item) G coefficient G coefficient (items only)

0.0050 0.0077 0.0217 0.0133 0.1163 0.0005 0.0024 0.635 0.687

0.0055 0.0008 0.0322 0.0289 0.0991 0.0010 0.0024 0.622 0.706

0.0047 0.0165 0.0213 0.0163 0.1165 0.0004 0.0021 0.639 0.697

0.0051 0.0083 0.0251 0.0195 0.1106 0.0006 0.0023 0.632 0.697

3 5 15 11 66 21 79

 Blackwell Publishing Ltd 2006. MEDICAL EDUCATION 2006; 40: 618–623

622

case specificity

Table 3 D study of reliability by number of items and cases Question(s) ⁄ case

Cases n

Questions n

G coefficient (relative)

1 1.5 2 3 4 5

30 24 20 15 12 10

30 36 40 45 48 50

0.541 0.569 0.579 0.579 0.569 0.556

the net effect would be to actually increase the test reliability. The actual calculations are shown in the last row of Table 2 and indicate that there is an average increase in reliability of 0.065 by treating items independent of cases, a 10% difference. The D study showed that the reliability coefficients varied between 0.541 and 0.579, was least with 1 item per case, and was highest at 2 and 3 items per case (Table 3). Thus, the optimal strategy in terms of enhancing reliability would use cases with 2–3 items per case.

DISCUSSION AND CONCLUSIONS The present findings are clearly at variance with the hypothesis of case specificity in regard to the sources of error variance associated with cases. That hypothesis predicts that most of the variance will be due to differences among cases and relatively little to the effects of the items within the cases. We found the opposite. Relatively little of the error variance was due to differences between cases, whereas about 80% was due to variability in examinees’ performance among items within cases. This finding has major implications for the notion of case specificity. The low correlation across cases typically observed with many assessment methods13 is not a consequence of knowledge clustered within cases, but reflects the relatively few items treated within each case. While knowledge at the item level does, to some degree, covary with cases, the case remains a small contributor to error variance. Thus, our traditional assumption of case specificity is simplistic. The main source of variability is not due to cases but to the specific items nested in the cases. There are also practical consequences. If we assume that error variance derives from cases, then the optimal sampling strategy would involve many cases with 1 item per case. However, we have shown, using plausible assumptions about time distribution and

constraints (e.g. a fixed 3-hour testing time), that fewer cases, with an average of 2–3 items per case, actually lead to higher reliability. This approach also has advantages from both the logistical and acceptability perspectives. Examination preparation, such as in the key feature framework, entails an interaction among selecting cases from a test blueprint, defining appropriate key features, developing suitable cases, and writing focused questions. Case development is easier because, once a case is designed, it is relatively easy to come up with several items, as long as the items address critical or essential elements of data gathering, diagnosis and ⁄ or management. Further, from the candidate perspective, once they are involved in working through a case, it is easier to address several related questions, with a sense that one has ÔcompletedÕ the case, than to rapidly move from 1 case to the next. Finally, from a scoring perspective and given the case and item variance components (see equation 3), the unit of measurement remains the cases, with scores at the individual question level averaged to give a case score and case scores totalled to give a test score.

Contributors:

GN contributed to study design, data analysis and interpretation, and wrote the first draft of the manuscript. GB contributed to study design, data interpretation, and drafting and critical revision of the manuscript. GP contributed to study design, data interpretation, and critical revision of the manuscript. DK contributed to data analysis and interpretation, and critical revision of the manuscript. All authors approved the final version of the paper. Acknowledgements: we are grateful to the Medical Council of Canada, in particular Andre´-Phillipe Boulais, Ilona Bartman and Timothy Wood, for providing the anonymised data from the qualifying examination. Funding: the project was funded in part by a grant from the Medical Council of Canada. Conflicts of interest: Drs Bordage and Page were the original authors of the key features approach used in this study. Ethical approval: ethical approval was granted by the Review Ethic Board at McMaster University.

REFERENCES 1

McGuire CH, Babbott D. Simulation technique in the measurement of problem-solving skills. J Educational Measurement 1967;4:1–10.

 Blackwell Publishing Ltd 2006. MEDICAL EDUCATION 2006; 40: 618–623

623

2

3

4 5

6

7

8

Elstein AS, Shulman LS, Sprafka SA. Medical Problem Solving. Cambridge, Massachusetts: Harvard University Press 1978;292–4. Neufeld VR, Norman GR, Barrows HS, Feightner JW. Clinical problem solving by medical students: a longitudinal and cross-sectional analysis. Med Educ 1981;15:315–22. Perkins DN, Salomon G. Are cognitive skills contextbound? Educ Researcher 1989;18:16–25. van der Vleuten CPM, Swanson DB. Assessing clinical skills with standardised patients: the state of the art. Teach Learn Med 1990;2:58–76. Norcini JJ, Lipner RS, Kimball HR. Certifying examination performance and patient outcomes following acute myocardial infarction. Med Educ 2002;36:853–9. Norman GR, Tugwell P, Feightner JW, Muzzin LJ. Knowledge and clinical problem-solving ability. Med Educ 1985;19:344–56. Hatala R, Norman GR, Brooks LR. Influence of a single example upon subsequent electrocardiogram interpretation. Teach Learn Med 1999;11:110–7.

9 10

11

12

13

Knox JDE. How to use modified essay questions. Med Teacher 1980;2:20–4. Page G, Bordage G. The Medical Council of Canada’s Key Feature Project: a more valid written examination of clinical decision-making skills. Acad Med 1995;70:104–10. Page G, Bordage G, Allen T. Developing key-feature problems and examinations to assess clinical decisionmaking skills. Acad Med 1995;70:194–201. Brennan RL. Urgenova. Iowa City: Center for Advanced Studies in Measurement and Assessment, College of Education, University of Iowa. http://www.education. uiowa.edu/casma/computer_programs.htm. [Accessed 5 July 2005.] Eva KW. On the generality of specificity. Med Educ 2003;37:587–8.

Received 14 December 2005; accepted for publication 13 March 2006

 Blackwell Publishing Ltd 2006. MEDICAL EDUCATION 2006; 40: 618–623

Lihat lebih banyak...

Comentários

Copyright © 2017 DADOSPDF Inc.