BACKGROUND: Early prediction of depressive and anxiety disorders is challenging due to substantial heterogeneity in risk pathways. Conventional machine-learning models trained on aggregated populations may obscure subgroup-specific mechanisms and limit interpretability for prevention. We evaluated whether a hybrid unsupervised-supervised framework can identify meaningful subgroups and yield more interpretable risk prediction. METHODS: We analyzed cohort data of 15,897 Japanese adults who completed baseline (August-September 2020) and 6-month follow-up (February-March 2021) surveys and did not screen positive for depressive and anxiety disorders at baseline (K6 score < 13). Using 169 baseline demographic, psychosocial, lifestyle, and behavioral variables, we performed hierarchical clustering to derive data-driven subgroups. Within each cluster, we trained Random Forest models to predict incident screened depressive and anxiety disorders at follow-up (K6 ≥ 13) and interpreted predictors using SHapley Additive exPlanations (SHAP). RESULTS: The overall 6-month incidence was 6.23%. A five-cluster solution revealed two high-risk subgroups: an older-adult profile with poor quality of life (12.9%) and a working-parent profile characterized by work-family overload (29.8%). Compared with a global model trained on the full sample, the cluster-then-predict framework showed broadly similar overall performance but performed better in the highest-risk subgroup and revealed more differentiated predictor profiles. Loneliness, health-related quality of life, happiness, and personality traits predominated in clusters with moderate adversity, whereas lifestyle disruption (sleep, diet, and irregular routines) characterized the high-risk late-life subgroup and alcohol dependence and work-family burden characterized the high-risk working-parent subgroup. CONCLUSIONS: Addressing risk-factor heterogeneity before prediction may enable more interpretable, context-tailored prevention strategies.
Chen et al. (Thu,) studied this question.