Multi-modal Auto-Encoders as Joint Estimators for Robotics Scene Understanding

Key Points

Key points are not available for this paper at this time.

Abstract

We explore the capabilities of Auto-Encoders to fuse the information available from cameras and depth sensors, and to reconstruct missing data, for scene understanding tasks. In particular we consider three input modalities: RGB images; depth images; and semantic label information. We seek to generate complete scene segmentations and depth maps, given images and partial and/or noisy depth and semantic data. We formulate this objective of reconstructing one or more types of scene data using a Multi-modal stacked Auto-Encoder. We show that suitably designed Multi-modal Auto-Encoders can solve the depth estimation and the semantic segmentation problems simultaneously, in the partial or even complete absence of some of the input modalities. We demonstrate our method using the outdoor dataset KITTI that includes LIDAR and stereo cameras. Our results show that as a means to estimate depth from a single image, our method is comparable to the state-of-the-art, and can run in real time (i.e., less than 40ms per frame). But we also show that our method has a significant advantage over other methods in that it can seamlessly use additional data that may be available, such as a sparse point-cloud and/or incomplete coarse semantic labels.

Mark Helpful

Bookmark

Relay

View Full Paper