Existing stereo visual-inertial initialization methods, whether tightly or loosely coupled, rely critically on intermediate variables like feature correspondences and camera poses rather than original image data. Computing these variables through feature tracking and Structure-from-Motion (SfM) inherently introduces errors, adversely affecting results. To overcome this limitation, we propose a direct initialization method for stereo visual-inertial odometry, which directly bridges original image intensities and initial parameters, bypassing conventional intermediate variables. Specifically, we introduce a prediction function to compute the corresponding points from the initial parameters. Then we formulate an objective function that optimizes initial parameters by minimizing the photometric error of sparse points, eliminating the need for feature tracking and SfM. The metric scale in our initialization is directly determined by the stereo baseline. We further propose an approximation method for two-frame initialization, demonstrating its efficacy even with minimal frame data. Extensive experiments confirm that our method achieves superior performance in both estimation accuracy and initialization success rate with shorter runtime. Even with 3 frames for initialization, our method outperforms the state-of-the-art methods using 10 frames in most metrics.
Qiu et al. (Thu,) studied this question.