Key points are not available for this paper at this time.
This paper gives a summary of automatic passive reconstruction of 3D scenes from images and video. Features are tracked between the images. Relative camera poses are then estimated based on the feature tracks. Both the feature tracking and the robust method used to estimate the camera poses has recently been shown to provide robust real-time performance. Given the camera poses, a more elaborate method is used to derive dense textured surfaces. The dense reconstruction method was originally part of the authors thesis D. Nister (2001). It first computes depth maps using graph cuts, optimizing a Bayesian formulation. The graph cut operation is used to let depth map hypotheses from various sources compete in order to optimize the cost function. The hypothesis generators used are feature based surface triangulation, plane fitting and multihypothesis multiscale patch-based stereo. The requirements for a cost function to fit into the graph cut framework were already given in D. Nister (2001), along with a procedure for constructing the graph weights corresponding to any cost function that satisfies the requirements. Depth maps from many different viewpoints are then robustly fused to give a consistent global result using a process called median fusion. The robustness of the median fusion is important. It departs from the too common practice of fusing results by either using the union of free-space constraints as the resulting free-space, or the union of all predicted surfaces as the resulting surfaces, which is tremendously error-prone, since it essentially corresponds to taking either the minimum or maximum of the individual results.
D. Nistér (Mon,) studied this question.