This paper presents the design and implementation of a real-time Visual AI Agent on the NVIDIA Jetson Orin Nano. The system integrates YOLOv8 for object detection, BLIP for image captioning, and Places365 for contextual scene recognition, forming a robust pipeline capable of not only detecting objects in video streams but also describing their context in natural language. Initially leveraging GPT-4V for rich scene understanding, we optimized our solution for a fully off-line, GPU-accelerated inference with ONNX models. Our experiments demonstrate real-time object detection, rapid contextual captioning, and accurate scene labelling, validating the Jetson Orin Nano as an effective edge AI platform for smart surveillance and assistive technologies. The proposed system demonstrates real-world applicability in smart surveillance environments, assistive navigation tools, and privacy-preserving embedded vision systems.
Maestre et al. (Mon,) studied this question.