Purpose: In 2024, the University of the West Indies transitioned from discipline-specific final examinations to a unified medical exit examination. This study evaluates the feasibility and psychometric performance of this unified format, focusing on written item discrimination and the comparability of multiple Objective Structured Clinical Examination (OSCE) circuits. Methods: A retrospective analysis of de-identified results from all candidates sitting the unified examination at the St Augustine Campus in May/June 2025 was conducted. The assessment comprised a 320-item single best answer paper and a 17-station OSCE delivered concurrently across seven circuits. Inter-circuit differences were tested with one-way analysis of variance (ANOVA). Reliability was estimated using Cronbach’s alpha and Generalizability Theory (G- and phi coefficients). Decision-study modelling estimated the number of OSCE stations required for high-stakes reliability. Pearson’s correlation assessed the relationship between written and OSCE performance. Results: Scores from 157 candidates were analysed. Of 320 MCQs, 163 (50.9%) demonstrated acceptable discrimination with a point-biserial correlation coefficient (PBSC ≥ 0.20) and 26 (8.1%) showed negative discrimination, indicating the need for post-examination item review. Although 16 of 18 OSCE stations showed statistically significant inter-circuit differences, these variances were substantially attenuated upon aggregation; total OSCE scores showed only minor but statistically significant difference in total OSCE scores between circuits. Overall OSCE reliability was moderate (Cronbach’s alpha 0.72; G-coefficient 0.72; phi coefficient 0.69). Decision-study modelling indicated that approximately 20 stations would be required to achieve reliability suitable for high-stakes decisions. Written and OSCE scores correlated positively (r = 0.70, p < 0.001). Conclusions: A unified final exit examination is feasible and psychometrically defensible in large cohorts, but requires adequate OSCE station sampling to support high-stakes decisions.
Maharaj et al. (Wed,) studied this question.