As enterprises migrate operational systems from rule-based classifiers to large language model (LLM)-powered agents, the structure and statistical behavior of the resulting operational data changes in ways that standard model evaluation frameworks do not capture. Intent taxonomies shift, confidence distributions degrade, fallback rates spike, and response latency increases; yet most evaluation approaches focus on individual response quality rather than the reliability of the underlying data for downstream analytics and business intelligence reporting. This paper presents a drift detection framework for monitoring AI-generated operational data in enterprise systems. The framework evaluates four reliability dimensions: intent taxonomy drift (Jensen-Shannon divergence), confidence score distribution shift (Kolmogorov-Smirnov test), fallback and latency instability, and BI readiness. A weighted aggregator produces a reliability score from 0 to 100 and a three-tier verdict (Ready, Caution, Not Ready). An LLM interpretation layer generates operational narratives for data engineering teams. The framework is evaluated across controlled drift scenarios on synthetic enterprise agent logs and validated on the public BANKING77 dataset (10,003 records, 77 intents). Results show monotonically increasing drift signals across all metrics as injected drift increases from 0% to 50%, with framework verdicts correctly escalating from Ready to Not Ready. All code and data are publicly available.
Ritika De (Sat,) studied this question.