Recent advances in language and vision models are reshaping the way humans interact with autonomous systems. This paper presents an intelligent framework for cognitive human-machine collaboration in indoor mobility applications. The system interprets spoken or written natural language commands and executes corresponding aerial missions by integrating speech recognition, large language models, and vision-based reasoning. This enables the drone to understand human intent, analyze its environment, and perform context-aware actions such as identifying individuals, inspecting sensitive information, or auditing workstation screens. A web-based interface facilitates real-time interaction and feedback. The framework was deployed in three real-world indoor scenarios using a lightweight drone platform, demonstrating the feasibility and flexibility of the proposed pipeline. While no quantitative benchmarks were applied, the study reports observed performance across the scenarios and highlights key limitations, including sensitivity to lighting conditions, ambient noise, and battery-related compute constraints. These findings support the promise of multimodal systems for collaborative aerial tasks and identify future opportunities for quantitative evaluation and robustness improvements.
Alsufaian et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: