What question did this study set out to answer?

The aim is to comprehensively evaluate large language models in transforming semi-structured data into natural language, focusing on evaluation metrics.

February 2, 2026

A Comprehensive Performance Evaluation of LLMs for Data-to-Text Generation and Divergence-Weighted Training

Key Points

The aim is to comprehensively evaluate large language models in transforming semi-structured data into natural language, focusing on evaluation metrics.
Evaluated twelve LLMs from five open-source families across five D2T datasets.
Used six established automatic metrics and supplemented with human evaluations for deeper insight.
Conducted robustness analyses with different fine-tuning and decoding strategies.
Proposed divergence-weighted training to enhance model performance.
Larger model sizes generally improved readability and informativeness.
Llama 2 exhibited the best overall performance in D2T tasks.
Faithfulness did not consistently improve with larger models and sometimes degraded.
Human evaluations favored larger models for their overall quality despite minor faithfulness errors.
Performance declined with increased source-reference divergence across all models.

Abstract

Data-to-text generation (D2T) aims to transform semi-structured data (such as tables and graphs) into natural language text. With the exceptional capability of large language models (LLMs), they have become ubiquitous as foundational models for D2T. This paper presents a comprehensive evaluation of LLMs for D2T, focusing on three key qualities: readability (fluency and coherence), informativeness (content preservation), and faithfulness (factual accuracy). We evaluate twelve LLMs from five prominent open-source families (BART, T5, BLOOM, OPT, and Llama 2) across five widely used D2T datasets using six established automatic metrics, complemented by human evaluation for deeper insight. Our findings reveal that larger model sizes generally improve readability and informativeness, with Llama 2 showing superior overall performance. However, increased model size does not consistently enhance faithfulness and may sometimes degrade it. Human evaluations indicate that larger models are generally preferred for their readability, informativeness, and faithfulness from the human readers’ perspective, as their minor faithfulness errors are assessed more selectively by automatic evaluation metrics. Through robustness analyses, we confirm that these trends remain stable across different fine-tuning (QLoRA vs. Prefix-Tuning) and decoding (Beam Search vs. Nucleus Sampling) strategies. Furthermore, our experiments show that performance consistently declines as source-reference divergence increases, regardless of model size. To mitigate this, we propose a source-reference divergence-weighted training that adaptively reweights training instances based on their source-reference divergence, achieving consistent improvements across all three key evaluation qualities. This comprehensive study provides practical insights into LLM behavior in D2T and introduces an effective training paradigm for improving performance in D2T.

Perguntar à IA

Bookmark

Perguntar à IA

Bookmark

A Comprehensive Performance Evaluation of LLMs for Data-to-Text Generation and Divergence-Weighted Training

Key Points

Abstract

Cite This Study