Key points are not available for this paper at this time.
Abstract Through training on large-scale image-text pairs, vision-language models (VLMs) have gained the ability to align visual information with semantic understanding using natural language, thus leading to better performance on downstream tasks on unseen data. Due to a lack of large-scale neuromorphic vision datasets that also include natural language, training with large-scale datasets to achieve generalized understanding is not feasible. Therefore, this work introduces a neuromorphic adapter neural network that leverages CLIP-based semantic understanding with neuromorphic-inspired feature extraction, thereby enhancing the robustness and efficiency of object classification in real-world scenarios. Specifically, by incorporating the temporal information of neuromorphic vision and the multimodal strengths of CLIP, our approach excels in few-shot learning, effectively extending comprehension across object classification tasks. We evaluate our approach on three public datasets: N-Cars, N-Caltech, and N-ImageNet, yielding encouraging classification accuracy after few-shot learning compared to the state-of-the-art models. Moreover, compared to zero-shot inference, our approach achieves +9.92%, +33.65%, and +42.63% improvements under 1-shot, 15-shot, and 20-shot settings in classification accuracy, respectively. Consequently, it demonstrates the effectiveness of adapting pre-trained language-vision models for event data, enabling effective learning and inference even with limited annotated data.
Building similarity graph...
Analyzing shared references across papers
Loading...
Xiaoqian Huang
Hussain Sajwani
Oussama Abdul Hay
Khalifa University of Science and Technology
Building similarity graph...
Analyzing shared references across papers
Loading...
Huang et al. (Mon,) studied this question.
www.synapsesocial.com/papers/68e6849eb6db64358760d857 — DOI: https://doi.org/10.21203/rs.3.rs-4415554/v1