Camera-based 3D occupancy prediction commonly relies on bird’s-eye-view (BEV) representations, yet two limitations remain: optimization instability when inserting new modules into pre-trained BEV encoders, and height-agnostic BEV-to-voxel lifting that fails to preserve elevation-aware scene structure. We propose GSH-Occ (Gradient-Shielded and Height-Aware BEV Occupancy Network), a framework that tackles both issues through two complementary mechanisms. Gradient-Shielded Residual Dual Attention (GS-RDA) introduces a zero-initialized residual gate that preserves the identity mapping at initialization, allowing new attention modules to be grafted onto pre-trained encoders without disturbing learned features. Height-Aware Adaptive Lift (HAL) replaces naive channel replication with per-voxel adaptive fusion of BEV features and learnable height embeddings, followed by 3D convolutional refinement to capture vertical structure. On the Occ3D-nuScenes validation benchmark, GSH-Occ achieves 46.92 mIoU, outperforming FlashOcc by +3.40 mIoU. Ablation studies confirm that GS-RDA and HAL target distinct failure modes and yield complementary improvements.
Ou et al. (Thu,) studied this question.