What question did this study set out to answer?

This research aims to improve accuracy and efficiency in matching product images to text descriptions in e-commerce.

March 24, 2026

E-commerce Product Image-Text Matching Algorithm Based on Lightweight Transformer Cross-modal Attention Fusion (基于轻量级Transformer+跨模态注意力融合的电商商品图文匹配算法)

Key Points

This research aims to improve accuracy and efficiency in matching product images to text descriptions in e-commerce.
Developed a lightweight ITTransformer and TTransformer for feature extraction.
Implemented dynamic channel pruning and feature dimension compression to reduce parameters.
Designed a cross-modal attribute attention module to align text attributes with image features.
Utilized cosine similarity for matching determination.
Achieved Top1 accuracy of 88.7% on FashionGen and self-collected datasets.
Reduced model parameter count to 38.6 million.
Reached reasoning speed of 185 frames per second.

Abstract

Abstract: In response to the problems of low feature alignment accuracy, large computational power consumption of general models, and difficult attribute inconsistency verification in commodity-text matching scenarios in e-commerce, we propose a lightweight transformer and cross-modal attribute attention fusion based image-text matching algorithm. This algorithm first extracts commodity image visual features text semantic features separately through the lightweight ITTransformer and TTransformer, and introduces dynamic channel pruning (pruning rate 40%) and feature dimension compression technology (compression r=4) to reduce the number of parameters. Subsequently, a cross-modal attribute attention module (AGA) is designed for e-commerce, which uses text attribute as Queries and deeply aligns them with image spatial features. Finally, the cosine similarity is used to achieve matching determination. Experiments on FashionGen and self-collected ecommerce datasets show that the Top1 accuracy of this algorithm reaches 88.7%, the model parameter quantity is only 38.6M, and the reasoning speed 185 FPS. Under the premise of ensuring the accuracy of the match, it has been significantly lightweighted, and it is suitable for real-time recommendation and violation verification in e-commerce platforms.

Bookmark

Cite This Study

Xu et al. (Mon,) studied this question.

synapsesocial.com/papers/69c229bdaeb5a845df0d4a01 https://doi.org/https://doi.org/10.66106/kyyyau.20250301

Bookmark