Digital English teaching resources combine images and texts, rendering keyword-based retrieval poor in semantic capture.This paper proposes a multimodal retrieval model integrating image recognition and natural language processing to solve this problem.The proposed method uses a five-layer CNN to extract 512-dimensional visual features from 224×224 pixel images and a BERT-based semantic model to generate 768-dimensional text vectors.These features are fused into a 1,024-dimensional unified representation via weighted linear combination (visual weight 0.4, text weight 0.6).On 500 query tasks, the proposed system achieves a precision of 0.91, recall of 0.88, and F1-score of 0.895, with an average response time between 0.63 and 0.78 seconds.Compared to a baseline TF-IDF model (precision 0.72) and a BERT-only model (precision 0.87), the proposed multimodal approach improve precision by 26.4% and 4.6%, respectively.These results demonstrate that multimodal feature representation significantly enhances semantic matching for teaching resources.
Can Li (Thu,) studied this question.