Voice-based interaction is increasingly central to human–robot communication, yet most deployed voice assistants depend heavily on cloud services, which introduces latency, requires continuous network connectivity, and raises data-privacy concerns. These constraints are especially limiting for robots operating in laboratories, industrial sites, and remote environ- ments. This paper proposes Wrapper, a local-first architecture for voice-based robot assistants that unifies local speech-to-text, retrieval-augmented generation (RAG) over a local knowledge base, on-device large language model (LLM) inference, and a cloud LLM that is invoked only as a last resort. The central design element is a confidence-based routing layer that escalates a query through three stages—RAG, then a local LLM, then a cloud LLM—using explicit relevance and confidence thresholds, so that cloud calls occur only when local options are exhausted. We describe the system architecture, the runtime inference pipeline, and the routing logic, and we present a proof-of-concept implementation built from open-source and freely available components (Whisper, sentence-transformer embeddings, a local vector store, an Ollama-hosted local LLM, and a cloud LLM API). The contribution of this work is architectural: a coherent, reproducible design for privacy-aware, low-dependency voice interaction in robotics. A systematic empirical evaluation of transcription accuracy, routing correctness, latency, and response quality is outlined as the immediate next step and is left to future work.
Hamim Fahmid Hossain (Sun,) studied this question.