What question did this study set out to answer?

June 2, 2026Open Access

Wrapper: A Local-First Architecture for Voice-Based Robot Assistants Using Speech Recognition, Retrieval-Augmented Generation, and Hierarchical LLM Fallback

Key Points

The study examines a local-first architecture for voice-based robot assistants that reduces reliance on cloud services.
Developed Wrapper architecture integrating local speech-to-text and on-device LLM inference.
Implemented a confidence-based routing layer for query escalation through RAG and LLMs.
Presented proof-of-concept using open-source components such as Whisper and local vector stores.
Proposed architecture aims to enhance transcription accuracy and reduce latency during human-robot interaction.
System designed to operate effectively in remote environments without continuous internet access.
Future work will empirically evaluate transcription accuracy and response quality.

Abstract

Voice-based interaction is increasingly central to human–robot communication, yet most deployed voice assistants depend heavily on cloud services, which introduces latency, requires continuous network connectivity, and raises data-privacy concerns. These constraints are especially limiting for robots operating in laboratories, industrial sites, and remote environ- ments. This paper proposes Wrapper, a local-first architecture for voice-based robot assistants that unifies local speech-to-text, retrieval-augmented generation (RAG) over a local knowledge base, on-device large language model (LLM) inference, and a cloud LLM that is invoked only as a last resort. The central design element is a confidence-based routing layer that escalates a query through three stages—RAG, then a local LLM, then a cloud LLM—using explicit relevance and confidence thresholds, so that cloud calls occur only when local options are exhausted. We describe the system architecture, the runtime inference pipeline, and the routing logic, and we present a proof-of-concept implementation built from open-source and freely available components (Whisper, sentence-transformer embeddings, a local vector store, an Ollama-hosted local LLM, and a cloud LLM API). The contribution of this work is architectural: a coherent, reproducible design for privacy-aware, low-dependency voice interaction in robotics. A systematic empirical evaluation of transcription accuracy, routing correctness, latency, and response quality is outlined as the immediate next step and is left to future work.

Wrapper: A Local-First Architecture for Voice-Based Robot Assistants Using Speech Recognition, Retrieval-Augmented Generation, and Hierarchical LLM Fallback

Key Points

Abstract

Cite This Study