Automated interpretation of otoscopic images is challenging due to subtle textural variations, anatomical complexity, and inconsistent acquisition conditions. This study aims to develop an accurate and interpretable deep learning framework for ear disease classification. This work presents BioOtoFusionNet, a clinically motivated dual-branch architecture integrating a Frequency-Aware Stream based on Discrete Wavelet Transform (DWT) sub-bands and a Shape-Aware Stream that captures anatomical structures using edge-based features and capsule attention. An Adaptive Cross-Fusion Module (ACFM) and Multi-Scale Attention Pooling (MSAP) are employed to effectively fuse complementary representations across spatial resolutions. BioOtoFusionNet achieved an overall accuracy of 96.8%, an F1-score of 95.6%, and an AUC-ROC of 98.4%, outperforming all ablated variants. High class-wise accuracy was observed for Earwax Plug (97.8%), Myringosclerosis (94.5%), Chronic Otitis Media (93.2%), and Normal Ear (97.1%). Clinically motivated interpretability metrics demonstrated balanced reliance on texture and structure (FHI = 0.60, SAR = 1.25), strong attention localization (ALS = 0.72), and stable multi-scale behaviour (MSAC = 0.87). Robustness analysis showed resilience to illumination variations (RII = 0.15) and low diagnostic ambiguity (DCS = 0.31). Evaluation was conducted on a four-class otoscopic dataset using stratified five-fold cross-validation with strict separation between training and evaluation samples. Data augmentation was applied only to training subsets to prevent information leakage. BioOtoFusionNet provides accurate, interpretable, and robust ear disease classification from otoscopic images, highlighting its potential for clinical decision support and telemedicine-based otologic screening.
Jain et al. (Fri,) studied this question.