Large Language Models (LLMs) have demonstrated remarkable proficiency in writing and executing code, leading to the development of autonomous agentic loops for Machine Learning (ML) engineering. However, when deployed autonomously without human intervention, these modelsexhibit distinct behavioral failure modes reminiscent of human cognitive biases. In this paper, we deploy 13 different LLMs- spanning frontier, general-purpose, and code-specialized architectures- into the Autonomous Empirical Optimization System (AEOS), a zero-human sandbox designed to autonomously solve ML pipelines. We introduce an "Extended-Horizon" experimentationframework where agents are granted massive iteration limits and widened patience thresholds,testing their intrinsic ability to recognize performance plateaus and autonomously terminate prior to system-forced intervention. Our findings reveal that both premium frontier models and general-purpose local models suffer from a severe "Autonomous Sunk-Cost Fallacy," trapping themselves in unproductive loops, wasting significant compute. Conversely, we demonstrate that modern, instruction-tuned code models possess a superior meta-reasoning alignment, allowing them to accurately identify stagnation and gracefully terminate exploration
Building similarity graph...
Analyzing shared references across papers
Loading...
Sanskar jajoo
HEM Technologies (United States)
HEM Technologies (United States)
Building similarity graph...
Analyzing shared references across papers
Loading...
Sanskar jajoo (Tue,) studied this question.
synapsesocial.com/papers/69f2a4da8c0f03fd67763fe5 — DOI: https://doi.org/10.5281/zenodo.19846959