Large language models (LLMs) often hallucinate—producing plausible but inaccurate responses—particularly when misjudging their own confidence arXiv:2401.01313. This paper introduces Topic-Aware Inference Boost, a modular microservice architecture designed to mitigate hallucinations through rapid, topic-specific inference augmentation. The system delivers just-in-time expert-level responses from curated subject-matter-expert (SME) models through a lightweight API, without requiring retraining or prompt engineering. The prototype demonstrates end-to-end latency of 1 to 7 seconds on standard CPUs with over 90 % inference quality for multiple domain tasks. By decoupling topic specialization from monolithic LLMs, this solution enables any client model to enhance its reliability through targeted grounding. Phase 2 will extend the framework to allow models to self-evaluate confidence and selectively invoke this solution for low-confidence inferences, maintaining real-time performance and high accuracy. Note To Readers This document, formerly titled "InferBoost," has been renamed to Topic-Aware Inference Boost to improve technical clarity and to disambiguate the research from external websites currently utilizing the "InferBoost" term. The underlying architecture, topic-identification methodology, and performance metrics remain unchanged.
Gitanjali GulveSehgal (Mon,) studied this question.