This study compares the performance of the Multidimensional Item Response Theory (MIRT), Higher-Order IRT (HO-IRT), and Bifactor models for the simultaneous estimation of total and subscale scores in multidimensional tests. Using both simulated data and real data from an English proficiency exam, model performance was evaluated in terms of accuracy (RMSE), reliability, and classification accuracy. The simulation included 5,000 respondents, 120 items, and a four-dimensional structure, manipulating item format, test difficulty, and inter-dimensional correlation. Results indicated that MIRT consistently outperformed the other models, yielding the lowest RMSE and highest reliability and classification accuracy across conditions. HO-IRT also showed strong performance, while the Bifactor model underperformed, particularly in subscore estimation. Model performance was sensitive to test characteristics and dimensional relationships. Findings from the real data analysis supported the simulation results, underscoring the value of multidimensional modeling for diagnostic feedback and informed decision-making.
ERDEMİR et al. (Wed,) studied this question.