Data-to-text generation (D2T) aims to transform semi-structured data (such as tables and graphs) into natural language text. With the exceptional capability of large language models (LLMs), they have become ubiquitous as foundational models for D2T. This paper presents a comprehensive evaluation of LLMs for D2T, focusing on three key qualities: readability (fluency and coherence), informativeness (content preservation), and faithfulness (factual accuracy). We evaluate twelve LLMs from five prominent open-source families (BART, T5, BLOOM, OPT, and Llama 2) across five widely used D2T datasets using six established automatic metrics, complemented by human evaluation for deeper insight. Our findings reveal that larger model sizes generally improve readability and informativeness, with Llama 2 showing superior overall performance. However, increased model size does not consistently enhance faithfulness and may sometimes degrade it. Human evaluations indicate that larger models are generally preferred for their readability, informativeness, and faithfulness from the human readers’ perspective, as their minor faithfulness errors are assessed more selectively by automatic evaluation metrics. Through robustness analyses, we confirm that these trends remain stable across different fine-tuning (QLoRA vs. Prefix-Tuning) and decoding (Beam Search vs. Nucleus Sampling) strategies. Furthermore, our experiments show that performance consistently declines as source-reference divergence increases, regardless of model size. To mitigate this, we propose a source-reference divergence-weighted training that adaptively reweights training instances based on their source-reference divergence, achieving consistent improvements across all three key evaluation qualities. This comprehensive study provides practical insights into LLM behavior in D2T and introduces an effective training paradigm for improving performance in D2T.
Mahapatra et al. (Fri,) studied this question.