Abstract The use of multimodal systems brings great prospects in solving complex problems unsolvable by the usual unimodal approaches. Therefore, a multimodal video and radar system are proposed to design a multimodal machine-learning problem on edge devices. The system’s architecture uses Docker containers to capture knowledge under models and processes, allowing the system to be easily managed. Furthermore, the importance of object detection is enhanced in the proposed system, as the identification and localization of objects in different data modalities are critical components of several multimodal machine-learning tasks. Hence, it is presented as an overall description of the architecture and a discussion of the data pipeline of this system. It is approached by the challenge of data alignment using homographic transformations with video camera and radar data, as well as using the system calibration to reach the data fusion and consequent predictions. It highlights the advantages of multimodal systems in dealing with complex and dynamic environments and provides a general approach to multimodal machine-learning problems on edge devices.
Ferraz et al. (Mon,) studied this question.