Video camouflaged object detection (VCOD) aims to identify objects that seamlessly blend into their surroundings in video sequences. Traditional methods merely rely on visual cues to capture inter-frame motion that reveals camouflaged objects. However, the high similarity between camouflaged objects and their environments often renders pure reliance on visual cues unreliable. Additionally, random motions including camera shaking and abrupt scene transitions also inevitably bring noise into the identification process. To overcome these challenges, we propose a Motion Reasoning Chain Network (MRCNet), a novel cross-modal VCOD framework that emulates the human thought process when observing camouflaged objects, i.e., motion reasoning. Specifically, we introduce a generative sampling strategy grounded in multimodal large language models (MLLMs) to bridge the implicit knowledge space of MLLMs and the explicit representation space regarding the attributes of camouflaged objects, thereby enabling the effective establishment of the motion reasoning chain tailored for VCOD. This process provides semantic guidance for visual comprehension of camouflaged objects through motion and concept attribute reasoning. To improve the identification capability of camouflaged objects, we develop motion representation learning driven by the motion reasoning chain. It introduces hierarchical de-biased motion prototype learning to mitigate hallucinations of MLLMs, boosting the motion perception. To learn precise prompts for the visual foundation model, cross-modal prompt learning further incorporates the de-biased concept prototype into visual representations to enhance the visual comprehension of camouflaged objects. Extensive experiments across three datasets demonstrate that MRCNet achieves state-of-the-art results on both general metrics and spatiotemporal consistency metrics.
Hui et al. (Fri,) studied this question.