In the practice of enterprise data security governance, document AI has emerged as a mission-critical component that seeks to underpin the prevention of document leakage via automatic accurate classification and identification of sensitive content. Arising from this, a need to bring document classification benchmark closer to real-world engineering applications is highlighted. This paper identifies the lack of public datasets for native multi-modal hybrid document classification and, accordingly, proposes the dataset DocCLSNMMH (Native Multi-Modal Hybrid Document Classification) along with its out-of-distribution (OOD) test subset. An experimental study on the proposed dataset demonstrates that current benchmarks have become irrelevant and need to be updated to evaluate native multi-modal hybrid documents. Meanwhile, accuracy degradation in heterogeneous documents and few-shot scenarios is assessed, as all of these are prevalent in the practice. The experimental results demonstrate that LayoutLM achieves a state-of-the-art (SOTA) performance with 98. 66% accuracy on DocCLSNMMH, with only approximately 7% accuracy degradation on its OOD test subset, while training-free models (Qwen2. 5-VL-32B and Gemma3-27B) consistently achieve over 95% accuracy across the full dataset. The SOTA performance of these models on our benchmark provides an effective guidance for model selection in real engineering applications.
Wang et al. (Fri,) studied this question.