Social-media platforms provide abundant signals related to mood disorders, yet building reliable supervised models is hindered by limited expert annotations and heterogeneous, noisy language. This paper introduces a two-stage framework for mood-state classification (mania, depression, normal) that leverages large-scale unlabeled posts while preserving evaluation rigor on a strictly held-out clinician-labeled benchmark (G^500ₓ₄ₒₓ). In Stage 1, we generate pseudo-labels using a Flan-T5 self-consistency scheme that samples multiple label proposals per post and aggregates them by majority vote to retain high-agreement instances. This yields markedly cleaner supervision, reaching 0. 870 accuracy and 0. 863 macro-F1 on G^500ₓ₄ₒₓ, improving over the strongest labeling baselines (0. 538 accuracy and 0. 446 macro-F1) by +0. 332 and +0. 417 absolute points (+61. 7% and +93. 5%, respectively). Importantly, worst-class robustness (Min-F1) increases from 0. 165 to 0. 830 (+0. 665 absolute; 5. 03, i. e. , +403%), clarifying that the large relative gain is driven by a low baseline Min-F1. In Stage 2, we cast model selection as a multi-objective optimization problem that jointly maximizes macro-F1 and worst-class F1 while minimizing inference latency, and solve it using Bayesian optimization with qEHVI (via BoTorch). The optimized configurations yield +4. 9% macro-F1 and +7. 3% minimum F1 with a 33% latency reduction relative to an untuned baseline (0. 803 macro-F1, 0. 772 Min-F1, latency 138. 6), providing a practical accuracy–efficiency trade-off. To quantify uncertainty and confirm that observed improvements are statistically supported, we perform paired significance analyses on G^500ₓ₄ₒₓ and report 95% bootstrap confidence intervals. Extensive experiments reveal Pareto-optimal solutions that are appropriate for deployment under resource constraints and demonstrate steady improvements across evaluation metrics.
Issam Zidi (Tue,) studied this question.