What question did this study set out to answer?

This research aims to improve grounded question answering in egocentric videos by addressing hallucinations and enhancing reasoning capabilities.

March 7, 2026

Collaborated with Hallucination: Enhancing Egocentric Grounded Question Answering via Error Demonstrations

Key Points

This research aims to improve grounded question answering in egocentric videos by addressing hallucinations and enhancing reasoning capabilities.
Proposed a Collaborated with Hallucination (CoHa) framework for Ego-GQA.
Used Subjective Logic to quantify uncertainty in answers.
Generated diffusion-based noisy visual inputs to create error demonstrations.
Implemented constraints based on quantified uncertainties to guide predictions in reasoning.
Introduced an interactive refinement module to explore finer cues in first-person views.
CoHa method significantly outperformed existing state-of-the-art techniques on two benchmarks.
The approach effectively reduced uncertainties in predictions related to egocentric contexts.
Error demonstrations constrained the model's reliance on biased pretraining priors.

Abstract

The grounded question answering in egocentric videos (Ego-GQA) aims to identify the relevant temporal window and generate corresponding responses in natural language given a textual question. Compared with third-person videos, egocentric video understanding requires more advanced human-centric thinking capability. However, existing Ego-GQA approaches often fail to distinguish the inherent limitations of dynamic egocentric context understanding, treating both first-person and third-person perspectives equally. This oversight leads to hallucinations and a lack of proper egocentric reasoning in first-person video understanding. To address this issue, we propose a novel Collaborated with Hallucination (CoHa) framework for the Ego-GQA, which quantifies the hallucinations generated by an Ego-GQA model and further leverages them as error demonstrations to constrain the model's reasoning process, encouraging it to ground predictions in egocentric visual cues instead of relying on biased pretraining priors. Specifically, we first employ Subjective Logic to quantify the degree of uncertainty in unreliable answers. We then generate diffusion-based noisy visual inputs to amplify the hallucinations as error demonstrations, which are used to append appropriate constraints to the model according to the uncertainty. These constraints effectively steer predictions away from the unreliable semantics induced by inherent drawbacks in egocentric thinking. Additionally, we incorporate an interactive refinement module to facilitate the model to explore more fine-grained cues observed from the first-person view. Extensive experiments on two widely used benchmarks demonstrate that our CoHa method outperforms recent state-of-the-art methods. Our code is available at https://github.com/Mrshenshen/CoHa.

Bookmark

Collaborated with Hallucination: Enhancing Egocentric Grounded Question Answering via Error Demonstrations

Key Points

Abstract

Cite This Study