What question did this study set out to answer?

This survey aims to identify and categorize failure modes of autoregressive large language models when used in industrial workflows.

February 28, 2026Open Access

Beyond Next-Token Prediction: A Standards-Aligned Survey of Autoregressive LLM Failure Modes, Deployment Patterns, and the Potential Role of World Models

Key Points

This survey aims to identify and categorize failure modes of autoregressive large language models when used in industrial workflows.
Synthesized prior work into a taxonomy of four AR failure modes.
Cataloged deployment patterns that meet industry standards.
Proposed an operational decision framework for implementing mitigations.
Identified compounding error, myopic objectives, data brittleness, and scaling inefficiencies as common failure modes.
Outlined human-gated LLM-in-the-loop and retrieval + verification pipelines as effective deployment patterns.
Developed an evidence map linking deployment patterns to empirical findings.

Abstract

This paper is a focused, standards-aligned survey of where autoregressive (AR) large language models (LLMs) tend to break down when deployed inside industrial informatics workflows that must satisfy long-horizon objectives, hard constraints, traceability, and functional-safety obligations (e.g., IEC 61508/ISO 26262/ISO 21448). Rather than claiming new algorithms or experiments, we synthesize and organize prior work into (i) a control-oriented taxonomy of four AR failure modes that recur in practice (compounding error, myopic objectives, data brittleness/hallucinations, and scaling/latency inefficiencies), (ii) a catalog of standards-compatible deployment patterns that mitigate these issues (human-gated LLM-in-the-loop, retrieval + verification pipelines, planner-of-record architectures, and runtime assurance envelopes), and (iii) an operational decision framework (criteria table with observable proxies, a stepwise decision procedure, and worked examples) for deciding when token-centric mitigations are sufficient versus when state/world-model components become warranted. Joint Embedding Predictive Architectures (JEPA) and Hierarchical JEPA (H-JEPA) JEPA are proposed as representative state-predictive architectures, with discussion explicitly bounded by currently available empirical evidence; we explicitly note that the published evidence base is currently concentrated on vision/multimodal benchmarks and that industrial control validation remains limited. To make evidence boundaries transparent, we introduce (a) a survey method (scope, inclusion/exclusion criteria, and data-extraction fields), (b) a comparison matrix across representative prior systems, and (c) an evidence map that links each deployment pattern to peer-reviewed empirical findings and system reports.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Lorenzo Ricciardi Celsi

James McCann

Journals

Electronics

Actions

Institutions

Broad Institute

Mercatorum University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Beyond Next-Token Prediction: A Standards-Aligned Survey of Autoregressive LLM Failure Modes, Deployment Patterns, and the Potential Role of World Models

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Also consider