Los puntos clave no están disponibles para este artículo en este momento.
Abstract Smartphones have become ubiquitous central hubs for daily interaction, yet complex mobile task execution remains heavily dependent on manual input, creating critical bottlenecks in intelligence, efficiency, and robustness—especially in hands-free scenarios like driving. To address this, we propose SVI-MMAgent (Speech-Visual Interaction Multimodal Multi-Agent Framework), a novel mobile automation system that fully integrates speech-visual interaction with a VLLM-based multi-agent architecture to enable truly hands-free, robust, and efficient task execution. SVI-MMAgent unifies Speech-to-Text (STT) for voice command input, Text-to-Speech (TTS) for verbal feedback, and a suite of specialized agents—including Planner, Operator, Reflector, Ending Judger, Record Judger, Dynamic Recorder, and a Lifelong Knowledge Pool—operating under a plan–execute–reflect paradigm. The framework delivers four key innovations: (1) an expanded action space with Scroll, Consecutive Tap, and Backspace to handle fine-grained UI interactions; (2) enhanced robustness and efficiency through an Ending Judger that prevents catastrophic termination errors, a dynamic short-term memory (controlled by Record Judger and Dynamic Recorder) that activates only when needed, and a Reflector augmented with prompt engineering to interpret subtle visual changes; (3) a bidirectional closed-loop voice verification mechanism—when execution falters, the system vocalizes its state via TTS, accepts user corrections through natural speech (via STT), and resumes only after verbal confirmation—enabling real-time, hands-free error recovery; and (4) a Lifelong Knowledge Pool that accumulates reusable task templates and user-specific habits, supporting concise command understanding and rapid workflow instantiation. Evaluated on the Mobile-Eval-E benchmark, SVI-MMAgent achieves state-of-the-art performance across three VLLM backbones: with GPT-4o, it attains a 95. 4% Satisfaction Score (+8. 5% over prior SOTA), 96. 3% Action Accuracy, 99. 5% Reflection Accuracy, and reduces Termination Error to just 2. 0%—a 6 × improvement. Consistent gains on Gemini-1. 5-pro and Claude-3. 5-Sonnet confirm that our architectural advances are orthogonal to the underlying model. These results demonstrate that SVI-MMAgent effectively realizes hands-free mobile automation in real-world scenarios, offering a robust, versatile, and interactive foundation for next-generation intelligent assistants.
Wang et al. (Sat,) studied this question.