ABSTRACT Large language models (LLMs) demonstrate strong performance in natural language tasks, but their capacity for genuine in‐context learning (ICL) in scientific regression remains unclear. We systematically assessed seven LLMs on molecular property prediction using a controlled framework of 56 transformed tasks that isolate shortcut learning and are designed to induce functional out‐of‐distribution (OOD) behavior. LLMs performed nearly perfectly on raw molecular weight prediction via shortcut cues but deteriorated under nonlinear transformations, whereas machine learning (ML) baselines showed greater robustness, yielding a performance crossover. Meta‐analysis revealed that distributional descriptors and structure–activity landscape indices (SALI) predict task favorability, providing a framework for selecting between LLM‐ and ML‐based approaches in chemistry.
Joe et al. (Thu,) studied this question.