Abstract In the gaze behavior understanding task, existing vision-based models demonstrate inherent limitations in high-dimensional semantic understanding, while vision-language models (VLMs) encounter challenges in precise object localization. To address this issue, we propose GazeLLM, the first zero-shot large language model (LLM) boosted framework for gaze target reasoning. Our key innovations include three aspects. First, we have structured object extraction. Using off-the-shelf detectors (e.g., MM-GroundingDINO and Depth Anything V2), we convert images into 3D object representations, including head and gaze direction, object categories, and metric depth. Second, we implemented an autonomous chain-of-thought (CoT) reasoning system. We designed self-generated CoT prompts to guide pretrained LLMs, such as ChatGPT o3-mini-high, to predict gaze targets via spatial-semantic analysis. Third, we proposed a plug-and-play module. We employed a novel cross-modal fusion mechanism that combines the LLM’s probability dictionaries with vision-based gaze heatmaps via Gaussian-weighted multi-hot mapping. Extensive experiments show that GazeLLM significantly improves state-of-the-art models, increasing their performance from 17% to 34% on challenging cases, such as long-range targets or rare categories, without the need for retraining. It also extends seamlessly to multi-person social gaze tasks (e.g., a 42% LAEO AP gain on the AVA-LAEO benchmark). Our framework demonstrates superior generalizability and interpretability compared to VLMs, validating the efficacy of LLMs in understanding gaze behavior by mining semantic cues.
Feng Lu (Mon,) studied this question.