What question did this study set out to answer?

The aim is to develop an ensemble framework for enhancing the truthfulness of question answering using small language models.

April 25, 2026Open Access

EdgeJury: Cross-Reviewed Small-Model Ensembles for Truthful Question Answering on Serverless Edge Inference

Key Points

The aim is to develop an ensemble framework for enhancing the truthfulness of question answering using small language models.
Developed a four-stage ensemble framework (EdgeJury) using small instruction-tuned language models (3B–8B).
Evaluated on TruthfulQA and a 200-question adversarial EdgeCases set for performance metrics.
Conducted manual error analysis to compare factual hallucination errors against a single-model baseline.
EdgeJury achieved 76.2% accuracy (95% CI: 72.8–79.6%) on TruthfulQA, a 21.4% improvement over the 8B single-model baseline.
Achieved +48.2% relative gain on a 200-question adversarial EdgeCases set.
Manual analysis showed around a 55% reduction in factual hallucination errors compared to the single-model baseline.

Abstract

Hallucinations hinder reliable question answering, especially in resource-constrained deployments where frontier-scale models or retrieval-heavy pipelines may be impractical. We present EdgeJury, a four-stage ensemble framework that improves truthfulness using only small instruction-tuned language models (3B–8B) suitable for serverless edge inference. EdgeJury combines role-specialized generation, anonymized cross-review, chairman synthesis, and claim-level agreement labeling. On TruthfulQA (MC1), EdgeJury reaches 76.2% accuracy (95% CI: 72.8–79.6%), a + 21.4% relative improvement over a single 8B baseline, and outperforms self-consistency and majority voting under transparent compute accounting. On a 200-question adversarial EdgeCases set, it achieves + 48.2% relative gains, while manual error analysis shows an approximately 55% reduction in factual hallucination errors versus the single-model baseline. Beyond end-to-end gains, we analyze why the framework works and when it is worth its cost. Stage 1 agents exhibit non-trivial complementarity, blind synthesis preserves most of the full system’s accuracy while reducing chairman self-preference, structured critique outperforms open-ended critique in orchestration reliability, and scaling/pruning experiments show that most of the gain is obtained by four agents with substantial latency recovery available through early exit. Deployed on Cloudflare Workers AI, EdgeJury achieves 8.4 s median end-to-end latency, showing that coordinated small-model ensembles can improve truthfulness on misconception-heavy QA benchmarks without external retrieval or proprietary large-model APIs.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Aayush Kumar (Thu,) studied this question.

synapsesocial.com/papers/69ec593e88ba6daa22dab2b7 https://doi.org/https://doi.org/10.1109/access.2026.3683784

Bookmark

View Full Paper