Technologies for recognizing facial attributes such as race, gender, age, and emotion from images of human faces have several applications, including personalized advertising, sentiment analysis, and the study of demographic trends and social behaviors. Analyzing face images and facial expressions presents several challenges due to the complexity of human facial attributes and the diversity in representation. While numerous attempts have been made to improve facial attribute classification performance, there remains a strong demand for enhanced accuracy. In this paper, we propose "FaceScanPaliGemma," a multi-agent vision language model (VLM) system consisting of four fine-tuned Google PaliGemma models, each specialized for a specific facial attribute classification. To evaluate the proposed solution, we used the public "FairFace" and "AffectNet" datasets. The results show high accuracy, reaching up to 81.1%, 95.8%, 80.0%, and 59.4% for race, gender, age group, and emotion classification, respectively, outperforming other VLMs such as OpenAI GPT, Google Gemini, LLaVA, and Google PaliGemma under zero-shot evaluation.
AlDahoul et al. (Mon,) studied this question.