The grounded question answering in egocentric videos (Ego-GQA) aims to identify the relevant temporal window and generate corresponding responses in natural language given a textual question. Compared with third-person videos, egocentric video understanding requires more advanced human-centric thinking capability. However, existing Ego-GQA approaches often fail to distinguish the inherent limitations of dynamic egocentric context understanding, treating both first-person and third-person perspectives equally. This oversight leads to hallucinations and a lack of proper egocentric reasoning in first-person video understanding. To address this issue, we propose a novel Collaborated with Hallucination (CoHa) framework for the Ego-GQA, which quantifies the hallucinations generated by an Ego-GQA model and further leverages them as error demonstrations to constrain the model's reasoning process, encouraging it to ground predictions in egocentric visual cues instead of relying on biased pretraining priors. Specifically, we first employ Subjective Logic to quantify the degree of uncertainty in unreliable answers. We then generate diffusion-based noisy visual inputs to amplify the hallucinations as error demonstrations, which are used to append appropriate constraints to the model according to the uncertainty. These constraints effectively steer predictions away from the unreliable semantics induced by inherent drawbacks in egocentric thinking. Additionally, we incorporate an interactive refinement module to facilitate the model to explore more fine-grained cues observed from the first-person view. Extensive experiments on two widely used benchmarks demonstrate that our CoHa method outperforms recent state-of-the-art methods. Our code is available at https://github.com/Mrshenshen/CoHa.
Li et al. (Thu,) studied this question.