Cultural heritage sites face accelerating degradations due to climate change, yet tradi-tional monitoring relies on unimodal analysis (visual inspection or environmental sen-sors alone) that fails to capture the complex interplay between environmental stres-sors and material deterioration. We propose a lightweight multimodal architecture thatfuses sensor data (temperature, humidity) with visual imagery to predict degradationseverity at heritage sites.Our approach adapts PerceiverIO with two key innovations: (1) simplified encoders(64D latent space) that prevent overfitting on small datasets (37 samples for training,555 with data augmentation; 13 for validation, and 13 for testing), and (2) AdaptiveBarlow Twins loss that encourages modality complementarity rather than redundancy.On data from Strasbourg Cathedral, our model achieves 76.9% accuracy and 77.0%weighted-F1 score on the test set, a 43% improvement over standard multimodal ar-chitectures (VisualBERT, Transformer) and 25% over vanilla PerceiverIO.Ablation studies reveal that sensor-only achieves 61.5% while image-only reaches46.2%, confirming successful multimodal synergy. A systematic hyperparameterstudy identifies an optimal moderate correlation target (τ =0.3) that balances align-ment and complementarity, achieving 69.2% accuracy compared to other τ values(τ =0.1/0.5/0.7: 53.8%, τ =0.9: 61.5%). This work demonstrates that architectural sim-plicity combined with contrastive regularization enables effective multimodal learningin data-scarce heritage monitoring contexts, providing a foundation for AI-driven con-servation decision support systems
Roqui et al. (Fri,) studied this question.