Abstract Thyroid disease affects approximately 20 million Americans and poses significant challenges for early diagnosis. In this study, we conduct a comprehensive comparative analysis using a recently curated, well-preprocessed tabular dataset of 9172 observations to evaluate the performance of both machine learning and deep learning models. The models include logistic regression, decision trees, random forest, support vector machines, XGBoost, LightGBM, artificial neural networks (ANN), and dense neural networks (DNN). Our results show that ensemble methods—particularly stacking and bagging—consistently outperform individual models in terms of F1 score and overall robustness. Stacking models with XGBoost as the meta-learner achieved the highest F1 score of 0.9944. We further demonstrate that addressing class imbalance through undersampling and label restructuring substantially improves model performance across multiple settings. These findings highlight the importance of ensemble techniques and thoughtful data preprocessing in medical classification tasks and offer updated performance benchmarks for thyroid disease prediction.
Zhong et al. (Fri,) studied this question.