What question did this study set out to answer?

This research aims to assess the effectiveness of four AI classifiers in detecting Shadow AI in enterprise environments.

May 26, 2026Open Access

Detecting Shadow AI in the Enterprise: A Four-Model Empirical Study

Key Points

This research aims to assess the effectiveness of four AI classifiers in detecting Shadow AI in enterprise environments.
Evaluated four classifiers: TF-IDF logistic regression, sentence-transformer encoder, bidirectional GRU, and zero-shot Llama-3.1-8B-Instruct judge.
Assembled an 800-prompt corpus divided into four classes for analysis.
Collected data across four commercial inference providers and performed a stratified partition for testing.
TF-IDF baseline achieved 95.6% accuracy and 0.96 Shadow F1 score.
Sentence-transformer reached 95.0% accuracy.
BiGRU averaged 91.1% accuracy with variability, while zero-shot judge plateaued at 68.8% accuracy.

Abstract

Abstract—Shadow AI, the use of large language models outside sanctioned enterprise channels, has become a material security and compliance concern. This study evaluates four prompt-only classifiers for routing enterprise LLM traffic into a four-class taxonomy (Benign, Legitimate AI, Shadow AI, and Automated AI Pipelines): a TF-IDF logistic-regression baseline, a sentence-transformer encoder with a linear head, a from-scratch bidirectional GRU, and a zero-shot Llama-3.1-8B-Instruct judge. The 800-prompt corpus, balanced 200 per class, was assembled in a two-stage workflow in which an LLM-generated draft was revised by Claude Opus 4.7 to reduce surface-template separability; collection across four commercial inference providers yielded 13,239 cleaned records and a stratified group-aware fold partition. On the held-out fold the TF-IDF baseline reaches 95.6% accuracy and a 0.96 Shadow F1, the sentence-transformer matches it at 95.0%, the BiGRU averages 91.1% ± 8.1% across five seeds with occasional collapse, and the zero-shot judge plateaus at 68.8%. The data indicate that lightweight lexical models on prompt text suffice for high-accuracy Shadow AI detection at the gateway, while heavier and zero-shot alternatives offer no consistent benefit.

Detecting Shadow AI in the Enterprise: A Four-Model Empirical Study

Key Points

Abstract

Cite This Study