In order to improve the multimodal consistency and semantic similarity of product image editing results, a text guided product image editing method based on multimodal feature fusion is proposed. Firstly, shape features are extracted through Hu moments, texture characteristics are described with a grey-level co-occurrence matrix, and edge features are detected via the Canny algorithm. Secondly, image features including shape, texture, and edges are integrated with target text information using a dual attention mechanism, thereby achieving multimodal feature fusion. Finally, text guided product image editing is achieved by employing a generative adversarial network model and combining the feature fusion results of target text with existing images. The experimental results demonstrate that a multimodal consistency coefficient of 0.98 and a visual semantic similarity of 0.990 can be achieved by the proposed method.
Mao et al. (Thu,) studied this question.