Abstract: In response to the problems of low feature alignment accuracy, large computational power consumption of general models, and difficult attribute inconsistency verification in commodity-text matching scenarios in e-commerce, we propose a lightweight transformer and cross-modal attribute attention fusion based image-text matching algorithm. This algorithm first extracts commodity image visual features text semantic features separately through the lightweight ITTransformer and TTransformer, and introduces dynamic channel pruning (pruning rate 40%) and feature dimension compression technology (compression r=4) to reduce the number of parameters. Subsequently, a cross-modal attribute attention module (AGA) is designed for e-commerce, which uses text attribute as Queries and deeply aligns them with image spatial features. Finally, the cosine similarity is used to achieve matching determination. Experiments on FashionGen and self-collected ecommerce datasets show that the Top1 accuracy of this algorithm reaches 88.7%, the model parameter quantity is only 38.6M, and the reasoning speed 185 FPS. Under the premise of ensuring the accuracy of the match, it has been significantly lightweighted, and it is suitable for real-time recommendation and violation verification in e-commerce platforms.
Xu et al. (Mon,) studied this question.