Abstract Accurate estimation of anthropometric attributes (height, weight, and body mass index (BMI)) from facial imagery supports emerging applications in telemedicine, soft biometrics, and large-scale population screening. Existing approaches struggle with limited multimodal fusion capabilities, demographic sensitivity, and reduced reliability in unconstrained visual conditions. We present an anthropometric ViT-hCMA, a cross-modal Vision Transformer integrating facial embeddings with demographic cues through a human-centric attention mechanism. The framework introduces three components: (i) a gradient-harmonized multi-task regression loss that balances heterogeneous anthropometric targets, (ii) cross-modal attention enabling physiologically consistent fusion of visual and auxiliary attributes, and (iii) CycleGAN-based augmentation improving robustness to pose and illumination variability. Evaluation on the VIP Attributes Dataset demonstrates consistent gains over prior methods, yielding MAE reductions of 12% for height, 15% for weight, and 18% for BMI. Interpretability analysis via transformer-based Grad-CAM confirms that the model identifies biomechanically meaningful regions, jawline and brow for height, and cheek-chin morphology for BMI, with stable behavior across demographic subgroups. While performance remains challenged in higher BMI ranges, ViT-hCMA offers a scalable and transparent solution for anthropometric prediction when full-body imagery is unavailable.
Sunusi Bala Abdullahi (Fri,) studied this question.