What question did this study set out to answer?

The goal is to design a low-latency architecture for serving Large Language Models (LLMs) in educational tutoring systems using commodity cloud resources.

March 8, 2026Open Access

A Practical Architecture for Low-Latency LLM Inference in Educational Assistants

Key Points

The goal is to design a low-latency architecture for serving Large Language Models (LLMs) in educational tutoring systems using commodity cloud resources.
Developed a production-scale LLM serving architecture on AWS EC2 g4dn instances with NVIDIA T4 GPUs.
Implemented optimizations including dynamic batching, KV cache offloading, prefix-affinity routing, and 8-bit GPTQ quantization.
Evaluated system performance using Locust-based load testing against JAMB examination questions.
Reduced median time-to-first-token latency by 59%, lowering it from 480 ms to 195 ms.
Increased concurrent session capacity from approximately 400 to about 3,200 on a four-node cluster.
Demonstrated answer accuracy equivalent to traditional methods during load testing.

Abstract

This paper presents the design and optimisation of a production-scale LLM serving architecture for educational tutoring systems deployed on commodity cloud hardware. The system runs on AWS EC2 g4dn instances equipped with NVIDIA T4 GPUs and uses vLLM for efficient inference. Starting from a single-node baseline, the architecture introduces dynamic batching, KV cache offloading, prefix-affinity routing, and 8-bit GPTQ quantisation. These optimisations reduce median time-to-first-token latency by 59% (480 ms → 195 ms) and increase concurrent session capacity from approximately 400 to about 3,200 on a four-node cluster. The evaluation uses Locust-based load testing and a benchmark of JAMB examination questions to assess both system performance and answer accuracy. The work demonstrates that production-scale LLM tutoring infrastructure can be deployed using widely available commodity GPU instances without specialised research hardware.

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper

Cite This Study

Akshita Bhardwaj (Sat,) studied this question.

synapsesocial.com/papers/69ad1331e7e9681137aa8fc0 https://doi.org/https://doi.org/10.5281/zenodo.18894249

KI fragen

Bookmark

View Full Paper