What type of study is this?

This is a Quantitative Study study.

September 17, 2025Open Access

Empirical Evaluation of Reasoning LLMs in Machinery Functional Safety Risk Assessment and the Limits of Anthropomorphized Reasoning

Key Points

Rule-grounded prompting consistently stabilized performance, achieving ceiling-level accuracy in estimating Required Performance Level (PLr).
Only rule-constrained prompts reliably captured rare but high-risk hazards, highlighting the importance of deterministic classification for safety-critical tasks.
CoT strategies incurred significant latency overhead, demonstrating that unconstrained reasoning may destabilize outcomes in risk assessment contexts.
Intermediate reasoning steps diverged from ISO-consistent logic, stressing that LLM 'reasoning' does not equate to genuine inferential capabilities.

Abstract

Transparent reasoning and interpretability are essential for AI-supported risk assessment, yet it remains unclear whether large language models (LLMs) can provide reliable, deterministic support for safety-critical tasks or merely simulate reasoning through plausible outputs. This study presents a systematic, multi-model empirical evaluation of reasoning-capable LLMs applied to machinery functional safety, focusing on Required Performance Level (PLr) estimation as defined by ISO 13849-1 and ISO 12100. Six state-of-the-art models (Claude-opus, o3-mini, o4-mini, GPT-5-mini, Gemini-2.5-flash, DeepSeek-Reasoner) were evaluated across six prompting strategies and two dataset variants: canonical ISO-style hazards (Variant 1) and engineer-authored free-text scenarios (Variant 2). Results show that rule-grounded prompting consistently stabilizes performance, achieving ceiling-level accuracy in Variant 1 and restoring reliability under lexical variability in Variant 2. In contrast, unconstrained chain-of-thought reasoning (CoT) and CoT together with Retrieval-Augmented Generation (RAG) introduce volatility, overprediction biases, and model-dependent degradations. Safety-critical coverage was quantified through per-class F1 and recall of PLr class e, confirming that only rule-grounded prompts reliably captured rare but high-risk hazards. Latency analysis demonstrated that rule-only prompts were both the most accurate and the most efficient, while CoT strategies incurred 2–10× overhead. A confusion/rescue analysis of retrieval interactions further revealed systematic noise mechanisms such as P-inflation and F-drift, showing that retrieval can either destabilize or rescue cases depending on model family. Intermediate severity/frequency/possibility (S/F/P) reasoning steps were found to diverge from ISO-consistent logic, reinforcing critiques that LLM “reasoning” reflects surface-level continuation rather than genuine inference. All reported figures include 95% confidence intervals, t-intervals across runs (r=5) for accuracy and timing, and class-stratified bootstrap CIs for Micro/Macro/Weighted-F1 and per-class metrics. Overall, this study establishes a rigorous benchmark for evaluating LLMs in functional safety workflows such as PLr determination. It shows that deterministic, safety-critical classification requires strict rule-constrained prompting and careful retrieval governance, rather than reliance on assumed model reasoning abilities.

Read Full Paperexternally

AI에게 질문

Bookmark

View Full Paper