What question did this study set out to answer?

The research aims to investigate the limitations and vulnerabilities of LLMs in making predictions on tabular data, especially concerning task-irrelevant changes.

June 3, 2026Open Access

Robustness is important: Limitations of LLMs for predictions on tabular data

Puntos clave

The research aims to investigate the limitations and vulnerabilities of LLMs in making predictions on tabular data, especially concerning task-irrelevant changes.
Examined prediction errors in LLMs due to irrelevant changes in data representation, such as variable name alterations.
Analyzed attention scores of LLMs to identify patterns in how predictions are influenced by input positions.
Compared general-purpose LLMs with state-of-the-art tabular foundation models to evaluate prediction performance.
LLMs show up to 82% variation in prediction error due to task-irrelevant modifications.
A non-uniform attention pattern was observed, with certain data positions receiving disproportionately high focus during output generation.
State-of-the-art tabular models demonstrated improved performance, but still exhibited sensitivity to irrelevant variations.

Resumen

Abstract Large Language Models (LLMs) are being applied in a wide array of settings, well beyond typical language-oriented use cases. In particular, LLMs are increasingly used as a plug-and-play method for generating predictions on tabular data. Prior work has shown that LLMs, via in-context learning or supervised fine-tuning, perform comparably with many tabular supervised learning techniques. However, we identify a critical vulnerability of using LLMs for tabular prediction -- making changes to data representation that are completely irrelevant to the underlying learning task can drastically alter LLMs' predictions on the same data. For example, simply changing variable names can sway the size of prediction error by as much as 82% in certain settings. Such prediction sensitivity with respect to task-irrelevant variations manifests under both in-context learning and supervised fine-tuning, for both close-weight and open-weight general-purpose LLMs. Moreover, by examining the attention scores of two open-weight LLMs, we discover a non-uniform attention pattern: training examples and variable names/values occupying certain positions in the prompt receive more attention when generating output tokens, even though fundamentally there should not be different emphasis a priori on data rows / columns in specific positions. This partially explains the sensitivity due to task-irrelevant variations. We also consider several state-of-the-art tabular foundation models trained specifically for tabular prediction. They achieve better prediction performance than general-purpose LLMs but are still not immune to task-irrelevant variations. Overall, LLMs (especially general-purpose models) currently lack a basic level of robustness to be used as a principled prediction tool.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo

Cite This Study

Liu et al. (Tue,) studied this question.

synapsesocial.com/papers/6a1fc616dee9eb8c0dce750f https://doi.org/https://doi.org/10.1093/pnasnexus/pgag197

Me gusta

Guardar

Ver artículo completo