What question did this study set out to answer?

This research aims to enhance robotic assistance by improving intention prediction and decision-making in human-robot collaboration.

June 17, 2026Open Access

Modular Framework for Responsive and Explainable Robotic Assistance with Intention Prediction Using Human-Centric Digital Twins

Key Points

This research aims to enhance robotic assistance by improving intention prediction and decision-making in human-robot collaboration.
Developed the Agentic Unified Robotic Assistance (AURA) Framework integrating Large Language Model reasoning with modular monitors.
Utilized a human-in-the-loop teleoperation data collection methodology and offline evaluation with an Appropriateness Score (A-Score).
Released a benchmark dataset of multimodal HRC episodes including workspace and robot camera videos.
Combined F1 score increased by over 20 points between context-poor and context-rich conditions across three tasks.
Lightweight multimodal models achieved performance on par with heavier models at one-fifth the inference latency.
The structured grounding improved intent prediction and decision-making progressively with richer contextual data.

Abstract

Proactive robotic assistance in human–robot collaboration (HRC) requires systems that can perceive evolving task contexts, anticipate user needs, and intervene appropriately without disrupting human workflow. We present the Agentic Unified Robotic Assistance (AURA) Framework, which couples Large Language Model (LLM) reasoning grounded by Standard Operating Procedures (SOPs) with a modular layer of specialized Intent, Motion, Perception, Sound, Affordance, and Performance Monitors that supply structured context to a central decision-making module, making the framework reconfigurable and auditable without retraining or re-prompting. We introduce a human-in-the-loop teleoperation data collection methodology and an offline evaluation scheme with an Appropriateness Score (A-Score) tailored to proactive intervention timing, and release a benchmark dataset of annotated multimodal HRC episodes containing workspace and robot wrist camera videos, robot joint states, and labeled intervention events. Across three tasks of varying complexity, we observe progressive gains in intent prediction and decision-making as the modules are supplied with richer grounded context (prior-state memory and tracked object locations), with Combined F1 rising by over 20 points between context-poor and context-rich conditions. The structured grounding allows lightweight multimodal backbones such as Gemini 3.1 Flash Lite to perform on par with heavier reasoning-tier models at roughly one-fifth the inference latency. Together, these contributions establish a scalable framework, benchmark, and evaluation methodology for advancing proactive robotic assistance in collaborative environments.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper