E-commerce platforms generate vast multi-modal data (product images and user reviews), whose integrated analysis is crucial for enhancing user experience and decision making.However, existing methods often treat visual perception and text sentiment analysis separately, limiting cross-modal semantic collaboration.Therefore, a multi-modal hierarchical collaborative fusion model (MHCFM) that unifies product visual attributes, aesthetic quality, scene context, and textual emotion is proposed via cross-modal alignment and hierarchical adaptive fusion.The model integrates a hierarchical visual transformer, a dual-branch aesthetic network, a graph convolutional scene module, and a hierarchical adaptive fusion network.Experiments on public and large-scale e-commerce datasets showed the sentiment analysis accuracy exceeded 93%, the inference time was 22-23 ms, outperforming mainstream models.In cross-cultural and multi-category tests, the average accuracy was 91.5%, demonstrating robustness.The proposed model enhances visual-textual collaboration, offering an efficient solution for intelligent product analysis and user experience optimisation in e-commerce.
Chunsheng Zhang (Thu,) studied this question.