Abstract Large Language Models (LLMs) are being applied in a wide array of settings, well beyond typical language-oriented use cases. In particular, LLMs are increasingly used as a plug-and-play method for generating predictions on tabular data. Prior work has shown that LLMs, via in-context learning or supervised fine-tuning, perform comparably with many tabular supervised learning techniques. However, we identify a critical vulnerability of using LLMs for tabular prediction -- making changes to data representation that are completely irrelevant to the underlying learning task can drastically alter LLMs' predictions on the same data. For example, simply changing variable names can sway the size of prediction error by as much as 82% in certain settings. Such prediction sensitivity with respect to task-irrelevant variations manifests under both in-context learning and supervised fine-tuning, for both close-weight and open-weight general-purpose LLMs. Moreover, by examining the attention scores of two open-weight LLMs, we discover a non-uniform attention pattern: training examples and variable names/values occupying certain positions in the prompt receive more attention when generating output tokens, even though fundamentally there should not be different emphasis a priori on data rows / columns in specific positions. This partially explains the sensitivity due to task-irrelevant variations. We also consider several state-of-the-art tabular foundation models trained specifically for tabular prediction. They achieve better prediction performance than general-purpose LLMs but are still not immune to task-irrelevant variations. Overall, LLMs (especially general-purpose models) currently lack a basic level of robustness to be used as a principled prediction tool.
Liu et al. (Tue,) studied this question.