What question did this study set out to answer?

The aim is to create a comprehensive framework for evaluating large language models, focusing on behavioral integrity under challenging conditions.

June 2, 2026Open Access

Xyraiq™: A Multi-Tier Behavioral Stress Benchmark for Large Language Model Evaluation with Hashable Audit Provenance

Key Points

The aim is to create a comprehensive framework for evaluating large language models, focusing on behavioral integrity under challenging conditions.
Introduced a modular Stress-Tier Engine to escalate prompt complexity.
Developed a Session Orchestrator for maintaining multi-turn context integrity during evaluations.
Implemented Persona Packs to identify failure modes across diverse scenarios.
Demonstrated enhanced detection of hallucination drift and ethical boundary erosion in multi-turn conditions.
Established a robust audit provenance system through the Stress Signature, meeting EU AI Act requirements.
Released benchmark as open core to promote widespread usage and enhancement.

Abstract

Existing large language model (LLM) evaluation frameworks address factual accuracy and single-turn instruction-following but lack systematic methods for assessing behavioral integrity under sustained adversarial pressure, multi-turn memory coherence, and reasoning fidelity across ethically and symbolically complex domains. This work introduces Xyraiq™, an open benchmark architecture that addresses these gaps through three integrated contributions. First, a modular Stress-Tier Engine escalates prompt complexity across defined behavioral thresholds, paired with a Session Orchestrator that maintains multi-turn context integrity across extended evaluation windows. Second, role-modular Persona Packs spanning professional, legal, executive, and spiritual-reasoning domains expose failure modes invisible to single-persona or single-turn regimes, detected by a Failure-Mode Detector layer tracking hallucination drift, ethical boundary erosion, and symbolic reasoning degradation across session state. Third, the Stress Signature — a hash-based cryptographic fingerprint encoding stress tier, persona context, failure events, and scoring outcomes per session — provides tamper-evident, institution-independent audit provenance not present in any current public benchmark. Xyraiq™ aligns with EU AI Act auditability requirements (Art. 53(1)) and is designed as governance infrastructure for behavioral accountability in high-stakes deployed AI systems. The benchmark is released as an open core under Apache 2.0, with premium persona suites licensed separately. This record establishes a dated public origination claim for the architecture, Stress Signature methodology, and IP claim structure at pre-provisional filing stage.

Xyraiq™: A Multi-Tier Behavioral Stress Benchmark for Large Language Model Evaluation with Hashable Audit Provenance

Key Points

Abstract

Cite This Study