March 3, 2026Open Access

Multimodal Translation Method Using Summary-Level Image Utilization

Key Points

MVNMT effectively reduces excessive translation issues compared to prior models using token-level image utilization.
The model incorporates summary-level images to better represent contextual information, enhancing translation fidelity.
Using a variational autoencoder, MVNMT extracts a common latent representation from both text and images.
Evaluations demonstrate that MVNMT surpasses traditional text-only translation models in performance metrics.

Abstract

本論文では，マルチモーダルニューラル翻訳におけるサマリレベルの画像の利用方法を提案する．従来のモデルでは，次の予測対象のトークンに関連する画像情報のみを抽出して利用することが一般的であったが，これが過剰翻訳を引き起こす可能性があることを明らかにする．この問題に対処するため，本研究では画像情報を文全体（サマリ）の特徴量のモデリングに利用し，これをデコーダに統合する新しいモデルであるmvnmtを提案する．mvnmtは，変分オートエンコーダを用いてテキストと画像の情報から共通の潜在表現を抽出する．本研究の実験結果は，mvnmtが従来のテキストのみを用いた翻訳モデルに比べて翻訳評価指標で上回り，かつ，トークンレベルの画像利用法を用いたmnmtモデルに比べて過剰翻訳の問題を効果的に緩和できることを示す．

Bookmark

View Full Paper

Bookmark

View Full Paper

Multimodal Translation Method Using Summary-Level Image Utilization

Key Points

Abstract

Cite This Study