Hallucinations hinder reliable question answering, especially in resource-constrained deployments where frontier-scale models or retrieval-heavy pipelines may be impractical. We present EdgeJury, a four-stage ensemble framework that improves truthfulness using only small instruction-tuned language models (3B–8B) suitable for serverless edge inference. EdgeJury combines role-specialized generation, anonymized cross-review, chairman synthesis, and claim-level agreement labeling. On TruthfulQA (MC1), EdgeJury reaches 76.2% accuracy (95% CI: 72.8–79.6%), a + 21.4% relative improvement over a single 8B baseline, and outperforms self-consistency and majority voting under transparent compute accounting. On a 200-question adversarial EdgeCases set, it achieves + 48.2% relative gains, while manual error analysis shows an approximately 55% reduction in factual hallucination errors versus the single-model baseline. Beyond end-to-end gains, we analyze why the framework works and when it is worth its cost. Stage 1 agents exhibit non-trivial complementarity, blind synthesis preserves most of the full system’s accuracy while reducing chairman self-preference, structured critique outperforms open-ended critique in orchestration reliability, and scaling/pruning experiments show that most of the gain is obtained by four agents with substantial latency recovery available through early exit. Deployed on Cloudflare Workers AI, EdgeJury achieves 8.4 s median end-to-end latency, showing that coordinated small-model ensembles can improve truthfulness on misconception-heavy QA benchmarks without external retrieval or proprietary large-model APIs.
Aayush Kumar (Thu,) studied this question.