What question did this study set out to answer?

This research aims to enhance the retrieval and matching of digital English teaching resources by integrating image and text analysis.

June 15, 2026Open Access

A multi modal fusion framework combining CNN-based image recognition and BERT-based NLP for intelligent retrieval and matching of English teaching resources

Key Points

This research aims to enhance the retrieval and matching of digital English teaching resources by integrating image and text analysis.
Developed a multimodal retrieval model combining CNN image recognition and BERT NLP.
Utilized a five-layer CNN to extract visual features from images and a BERT model for text vectors, fusing these into a unified representation.
Evaluated the system's performance on 500 query tasks measuring precision, recall, and F1-score.
Achieved precision of 0.91, recall of 0.88, and F1-score of 0.895 on the query tasks.
Compared to a baseline TF-IDF model, the new model improved precision by 26.4%.
Compared to a BERT-only model, precision improved by 4.6%.

Abstract

Digital English teaching resources combine images and texts, rendering keyword-based retrieval poor in semantic capture.This paper proposes a multimodal retrieval model integrating image recognition and natural language processing to solve this problem.The proposed method uses a five-layer CNN to extract 512-dimensional visual features from 224×224 pixel images and a BERT-based semantic model to generate 768-dimensional text vectors.These features are fused into a 1,024-dimensional unified representation via weighted linear combination (visual weight 0.4, text weight 0.6).On 500 query tasks, the proposed system achieves a precision of 0.91, recall of 0.88, and F1-score of 0.895, with an average response time between 0.63 and 0.78 seconds.Compared to a baseline TF-IDF model (precision 0.72) and a BERT-only model (precision 0.87), the proposed multimodal approach improve precision by 26.4% and 4.6%, respectively.These results demonstrate that multimodal feature representation significantly enhances semantic matching for teaching resources.

Demander à l'IA

Bookmark

View Full Paper

Cite This Study

Can Li (Thu,) studied this question.

synapsesocial.com/papers/6a2f97a9a1cfeec490828ba9 https://doi.org/https://doi.org/10.1504/ijris.2026.154120

Demander à l'IA

Bookmark

View Full Paper