Visually impaired users face significant challenges in navigating complex indoor environments due to limited spatial awareness and lack of real-time semantic guidance. This paper proposes a multimodal navigation system integrating environmental perception with vision-language models (VLMs). It provides context-aware and explainable guidance without requiring additional infrastructure. The proposed system combines RTAB-Map for localization, YOLO-World for open-vocabulary object detection, and a lightweight language model for semantic reasoning and natural language interaction. To evaluate our system, experiments are conducted using the RePOPE benchmark to assess hallucination in vision-language understanding. Real-world indoor navigation experiments are also performed. The results show that integrating perception with language-based reasoning improves precision by up to 2.29% and consistently enhances F1-score compared to baseline VLM approaches. Real-world experiments further demonstrate reliable navigation performance, including multi-floor path planning and obstacle-aware guidance. Hence, the proposed system effectively enhances spatial understanding and reduces hallucination, providing a practical and scalable solution for assistive navigation.
Lin et al. (Tue,) studied this question.