Key points are not available for this paper at this time.
Recent work has shown that CNN-based depth and ego-motion estimators can be using unlabelled monocular videos. However, the performance is limited unidentified moving objects that violate the underlying static scene in geometric image reconstruction. More significantly, due to lack proper constraints, networks output scale-inconsistent results over samples, i. e. , the ego-motion network cannot provide full camera over a long video sequence because of the per-frame scale. This paper tackles these challenges by proposing a geometry loss for scale-consistent predictions and an induced-discovered mask for handling moving objects and occlusions. Since we do leverage multi-task learning like recent works, our framework is much and more efficient. Comprehensive evaluation results demonstrate that depth estimator achieves the state-of-the-art performance on the KITTI. Moreover, we show that our ego-motion network is able to predict a scale-consistent camera trajectory for long video sequences, and the visual odometry accuracy is competitive with the recent model that is using stereo videos. To the best of our knowledge, this is the first to show that deep networks trained using unlabelled monocular videos can globally scale-consistent camera trajectories over a long video.
Bian et al. (Wed,) studied this question.