What question did this study set out to answer?

To develop a systematic model that simulates human-like cognitive mechanisms for visual question answering.

March 15, 2026

SHC: Deeply Activating Human-like Cognitive Ability for Visual Question Answering

Key Points

To develop a systematic model that simulates human-like cognitive mechanisms for visual question answering.
Proposed a Systematic Human-like Cognitive (SHC) method with three core modules: Hierarchical Perception, Semantic Refinement, and Dynamic Reasoning.
Implemented Hierarchical Perception Module to extract features simulating biological neural integration.
Designed Semantic Refinement Module to optimize high-level semantics through a multi-level attention structure.
Utilized Dynamic Reasoning Module to fuse features for superior visual representation and reasoning.
SHC demonstrated competitive performance on visual question answering benchmarks including VQA-v2 and GQA.
Outperformed other multimodal models, even those with larger scales, on specific tasks.
Validated capability to enhance multi-modal understanding tasks and fine-grained visual perception.

Abstract

Human cognitive mechanism depends on a sophisticated information processing framework, including perception, attention, memory, language, reasoning, problem solving and decision-making. However, current research only focuses on isolated process rather than systematically simulating human cognitive mechanism. Meanwhile, with the rapid development of large language models, related works have predominantly centered on language-level exploration, while in-depth mining of visual information remains insufficient. Here, to deeply activate the multi-modal understanding ability, a Systematic Human-like Cognitive (SHC) method is proposed for visual question answering, where the above mentioned sophisticated seven processes are systematically modeled as three core modules: hierarchical perception, semantic refinement and dynamic reasoning. The Hierarchical Perception Module (HPM) extracts hierarchical features from different levels to simulate the incremental integration mode of biological neural system. Based on the selective attention theory, a Semantic Refinement Module (SRM) is designed as a key-value accumulation optimization mechanism that enhances high-level semantics from low-level features via a multi-level cascaded attention structure. Finally, the Dynamic Reasoning Module (DRM), following the utility maximization decision theory, employs a dual weighting mechanism to dynamically fuse high-level semantic features and low-level fine-grained features, forming a unified high-quality visual representation that is then fed into the large language model for reasoning together with the text input. Experimental results demonstrate that SHC achieves competitive performance on multiple visual question answering benchmarks, including VQA-v2, Text-VQA, GQA, and ScienceQA, as well as multimodal evaluation benchmarks such as POPE, MMB, MME, and MM-Vet. Comparative experiments with multiple models of the same-scale validate the latent capacity of SHC to prompt the performance of multi-modal understanding tasks and its superiority in fine-grained visual information perception, and even surpasses multimodal models with larger-scale on certain tasks. Our code will be released on GitHub: https://github.com/fjwang3/SHC.

Bookmark

SHC: Deeply Activating Human-like Cognitive Ability for Visual Question Answering

Key Points

Abstract

Cite This Study