Large Language Models (LLMs) have achieved extraordinary generative and reasoning capabilities, yet their internal decision‑making processes remain opaque. Most interpretability methods are post‑hoc, attempting to explain a model’s output only after the forward pass has completed. This reactive paradigm leaves a persistent alignment gap: models can be confidently wrong, logically inconsistent, or sycophantic without any internal mechanism for self‑correction. This work introduces the Interpretability‑Driven Reasoning Architecture (IDRA), a runtime system that transforms reasoning from an implicit by‑product of token prediction into an explicit, recursively updated state vector. IDRA implements the Recursive State Vector Engine (RSVE), a feedback‑driven mechanism that regulates confidence, alignment, compute allocation, and logical coherence during generation. The architecture incorporates a Logic Lattice for structural consistency, a Verification Gate for evidence‑gated confidence updates, and a Miscalibration Penalty that discourages unwarranted certainty. We evaluate IDRA using a ten‑test suite spanning ambiguity, factual contradiction, safety‑critical reasoning, philosophical analysis, and long‑term preference extraction. Empirical results from the IDRALOG dataset show that IDRA maintains calibrated confidence, triggers penalties for high‑confidence errors, enforces evidentiary constraints on confidence increases, and dynamically allocates compute to reduce logical entropy. These behaviors emerge consistently across tasks, suggesting that runtime interpretability architectures can meaningfully improve transparency, safety, and reliability in LLM reasoning.
Building similarity graph...
Analyzing shared references across papers
Loading...
Elliot Monteverde
Building similarity graph...
Analyzing shared references across papers
Loading...
Elliot Monteverde (Mon,) studied this question.
www.synapsesocial.com/papers/698c1c46267fb587c655e946 — DOI: https://doi.org/10.5281/zenodo.18568162