This study proposes a multimodal large language model framework, FinDocLLM, designed specifically for financial document understanding that integrates chart, table, and textual information. Financial documents such as annual reports and earnings releases typically contain heterogeneous data modalities, yet existing approaches predominantly rely on unimodal text analysis, neglecting critical information embedded in charts and tables. To address this gap, this research constructs a cross-modal financial dataset comprising 3,200 annotated document pages from publicly listed companies and develops a three-stage training pipeline incorporating visual encoding, cross-modal alignment, and task-specific finetuning. Empirical results on three benchmark tasks (financial question answering, chart interpretation, and table reasoning) demonstrate that FinDocLLM achieves accuracy improvements of 15.3%, 18.7%, and 12.1% respectively over unimodal baselines. Additionally, ablation experiments confirm the complementary contributions of each modality. This study contributes to the growing body of literature on financial AI by providing a practical and effective approach to multimodal financial document analysis.
Shengxi Jin (Tue,) studied this question.