A multimodal vision–language framework using contrastive language-image pre-training for robust facial expression analysis in realistic classroom environments | Synapse