What question did this study set out to answer?

The aim is to improve the prediction of height, weight, and BMI from facial images using advanced techniques.

April 19, 2026Open Access

Multi-task cross-modal attention networks for robust anthropometric prediction

Puntos clave

The aim is to improve the prediction of height, weight, and BMI from facial images using advanced techniques.
Developed an anthropometric ViT-hCMA framework integrating facial and demographic data.
Implemented a gradient-harmonized regression loss for diverse anthropometric targets.
Used cross-modal attention for consistent fusion of visual attributes and demographic cues.
Employed CycleGAN-based augmentation to address variability in pose and lighting.
Achieved 12% MAE reduction for height prediction.
Achieved 15% MAE reduction for weight prediction.
Achieved 18% MAE reduction for BMI prediction.
Showed consistent performance improvements across demographic groups.

Resumen

Abstract Accurate estimation of anthropometric attributes (height, weight, and body mass index (BMI)) from facial imagery supports emerging applications in telemedicine, soft biometrics, and large-scale population screening. Existing approaches struggle with limited multimodal fusion capabilities, demographic sensitivity, and reduced reliability in unconstrained visual conditions. We present an anthropometric ViT-hCMA, a cross-modal Vision Transformer integrating facial embeddings with demographic cues through a human-centric attention mechanism. The framework introduces three components: (i) a gradient-harmonized multi-task regression loss that balances heterogeneous anthropometric targets, (ii) cross-modal attention enabling physiologically consistent fusion of visual and auxiliary attributes, and (iii) CycleGAN-based augmentation improving robustness to pose and illumination variability. Evaluation on the VIP Attributes Dataset demonstrates consistent gains over prior methods, yielding MAE reductions of 12% for height, 15% for weight, and 18% for BMI. Interpretability analysis via transformer-based Grad-CAM confirms that the model identifies biomechanically meaningful regions, jawline and brow for height, and cheek-chin morphology for BMI, with stable behavior across demographic subgroups. While performance remains challenged in higher BMI ranges, ViT-hCMA offers a scalable and transparent solution for anthropometric prediction when full-body imagery is unavailable.

Me gusta

Guardar

Ver artículo completo