Human cognitive mechanism depends on a sophisticated information processing framework, including perception, attention, memory, language, reasoning, problem solving and decision-making. However, current research only focuses on isolated process rather than systematically simulating human cognitive mechanism. Meanwhile, with the rapid development of large language models, related works have predominantly centered on language-level exploration, while in-depth mining of visual information remains insufficient. Here, to deeply activate the multi-modal understanding ability, a Systematic Human-like Cognitive (SHC) method is proposed for visual question answering, where the above mentioned sophisticated seven processes are systematically modeled as three core modules: hierarchical perception, semantic refinement and dynamic reasoning. The Hierarchical Perception Module (HPM) extracts hierarchical features from different levels to simulate the incremental integration mode of biological neural system. Based on the selective attention theory, a Semantic Refinement Module (SRM) is designed as a key-value accumulation optimization mechanism that enhances high-level semantics from low-level features via a multi-level cascaded attention structure. Finally, the Dynamic Reasoning Module (DRM), following the utility maximization decision theory, employs a dual weighting mechanism to dynamically fuse high-level semantic features and low-level fine-grained features, forming a unified high-quality visual representation that is then fed into the large language model for reasoning together with the text input. Experimental results demonstrate that SHC achieves competitive performance on multiple visual question answering benchmarks, including VQA-v2, Text-VQA, GQA, and ScienceQA, as well as multimodal evaluation benchmarks such as POPE, MMB, MME, and MM-Vet. Comparative experiments with multiple models of the same-scale validate the latent capacity of SHC to prompt the performance of multi-modal understanding tasks and its superiority in fine-grained visual information perception, and even surpasses multimodal models with larger-scale on certain tasks. Our code will be released on GitHub: https://github.com/fjwang3/SHC.
Wang et al. (Thu,) studied this question.