October 10, 2022

LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

YHYupan HuangMicrosoft Research (United Kingdom)TLTengchao LvMicrosoft (United States)LCLei CuiMicrosoft (United States)

Key Points

Key points are not available for this paper at this time.

Abstract

Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pre-trained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking. Additionally, LayoutLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks. Experimental results show that LayoutLMv3 achieves state-of-the-art performance not only in text-centric tasks, including form understanding, receipt understanding, and document visual question answering, but also in image-centric tasks such as document image classification and document layout analysis. The code and models are publicly available at https://aka.ms/layoutlmv3.

KI fragen

Bookmark

View Full Paper

Cite This Study

Huang et al. (Mon,) studied this question.

synapsesocial.com/papers/69d754aeaa68b335b4f30fd9 https://doi.org/https://doi.org/10.1145/3503161.3548112

KI fragen

Bookmark

View Full Paper