Key points are not available for this paper at this time.
While Vision Language Models (VLMs) offer transformative potential for adaptive robot decision-making, their practical deployment is severely hindered by a critical mismatch between high computational latency and the strict real-time requirements of robotic control. Conventional VLMs typically operate at low inference frequencies (e.g., 1–10 Hz), which fails to meet the high-frequency (e.g., 50–200 Hz) demands necessary for smooth, dynamic manipulation. This latency introduces significant control gaps, manifesting as robot jitter and actuation lag that ultimately lead to task failures in unstructured environments. Based on human motion process, this paper introduces a novel lightweight multimodal model, the Key Point Robotics Transformer (RT-K), built on the foundation of VLMs. By performing end-to-end inference on critical task points identified in the task process, the proposed model significantly reduces computational demands while achieving high-speed training and inference on consumer-grade GPUs. This approach enables robots to perform tasks smoothly and reliably, simultaneously reducing barriers for training and deployment of VLMs in robotics applications. Experimental results demonstrate that the model achieves high accuracy, with a root-mean-square error (RMSE) of 1.11°for joint control and a success rate of 92% on language grounding and motion generalization tasks, and reduces the number of inferences by about 96% and the total inference time by about 98.7%.
Li et al. (Fri,) studied this question.