December 1, 2025Open Access

GazeLLM: a plug-and-play zero-shot LLM reasoning framework for boosting gaze target detection

Key Points

GazeLLM improved gaze target detection significantly from 17% to 34% in tough cases, showcasing its effectiveness.
This framework leverages zero-shot reasoning using large language models to enhance gaze understanding.
The analysis utilized multiple vision-based methods, including object extraction and chain-of-thought reasoning.
GazeLLM supports robust results in multi-person social gaze tasks and displays high interpretability.

Abstract

Abstract In the gaze behavior understanding task, existing vision-based models demonstrate inherent limitations in high-dimensional semantic understanding, while vision-language models (VLMs) encounter challenges in precise object localization. To address this issue, we propose GazeLLM, the first zero-shot large language model (LLM) boosted framework for gaze target reasoning. Our key innovations include three aspects. First, we have structured object extraction. Using off-the-shelf detectors (e.g., MM-GroundingDINO and Depth Anything V2), we convert images into 3D object representations, including head and gaze direction, object categories, and metric depth. Second, we implemented an autonomous chain-of-thought (CoT) reasoning system. We designed self-generated CoT prompts to guide pretrained LLMs, such as ChatGPT o3-mini-high, to predict gaze targets via spatial-semantic analysis. Third, we proposed a plug-and-play module. We employed a novel cross-modal fusion mechanism that combines the LLM’s probability dictionaries with vision-based gaze heatmaps via Gaussian-weighted multi-hot mapping. Extensive experiments show that GazeLLM significantly improves state-of-the-art models, increasing their performance from 17% to 34% on challenging cases, such as long-range targets or rare categories, without the need for retraining. It also extends seamlessly to multi-person social gaze tasks (e.g., a 42% LAEO AP gain on the AVA-LAEO benchmark). Our framework demonstrates superior generalizability and interpretability compared to VLMs, validating the efficacy of LLMs in understanding gaze behavior by mining semantic cues.

Read Full Paperexternally

AI에게 질문

Bookmark

View Full Paper