Multi-view image analysis is a key enabler for robust perception when single viewpoints provide incomplete or ambiguous observations. This challenge is particularly pronounced in industrial inspection of transparent materials, where view-dependent optical effects, subtle surface degradations, and annotation noise significantly hinder reliable detection and severity assessment. In this work, we introduce a compact and efficient multi-view fusion architecture tailored to such constraints. Our approach combines shared-weight hierarchical encoders with selective state-space modeling to explicitly exploit cross-view and multi-scale correlations. Multi-View Mamba Blocks (MVMB) perform adaptive fusion at each feature level by coupling Mamba-based selective state-space layers with FiLM-driven cross-view conditioning, while a Global State-Space Fusion Block enforces long-range coherence across all views and resolutions. Task-specific decoding heads query the resulting global representation via cross-attention to jointly predict object localization and ordinal wear severity. The model is trained using a unified multi-task objective that integrates geometric regression, ordinal classification, cross-view consistency, feature alignment, and sequential smoothness. Extensive experiments on a challenging multi-view glass container inspection dataset demonstrate improved robustness, consistency, and scalability compared to strong baselines. To promote reproducibility and future research, we publicly release the proposed dataset at: https://datasets.liris.cnrs.fr/mvep-version1.
Bernardi et al. (Wed,) studied this question.