This paper presents the design and optimisation of a production-scale LLM serving architecture for educational tutoring systems deployed on commodity cloud hardware. The system runs on AWS EC2 g4dn instances equipped with NVIDIA T4 GPUs and uses vLLM for efficient inference. Starting from a single-node baseline, the architecture introduces dynamic batching, KV cache offloading, prefix-affinity routing, and 8-bit GPTQ quantisation. These optimisations reduce median time-to-first-token latency by 59% (480 ms → 195 ms) and increase concurrent session capacity from approximately 400 to about 3,200 on a four-node cluster. The evaluation uses Locust-based load testing and a benchmark of JAMB examination questions to assess both system performance and answer accuracy. The work demonstrates that production-scale LLM tutoring infrastructure can be deployed using widely available commodity GPU instances without specialised research hardware.
Building similarity graph...
Analyzing shared references across papers
Loading...
Akshita Bhardwaj
Building similarity graph...
Analyzing shared references across papers
Loading...
Akshita Bhardwaj (Sat,) studied this question.
www.synapsesocial.com/papers/69ad1331e7e9681137aa8fc0 — DOI: https://doi.org/10.5281/zenodo.18894249