Background Freeze-thaw processes leave diagnostic traces in archaeological soils and sediments that are central to reconstructing past climates and understanding hominin adaptations to glacial environments. Identifying these features through thin section micromorphology is well established, but these traces can be subtle, variably expressed, and overlap with other pedogenic processes, making their identification time-consuming, expert-dependent, and subject to high inter-observer variability. Methods We trained five convolutional neural network architectures on photomicrographs from eleven Plio-Pleistocene archaeological sites, implementing a two-stage classification approach (first presence, then feature type), and validated model outputs against both interpretability analysis and a blind survey of practicing micromorphologists. Results Results reveal a performance paradox: models achieve high performance but rely on spurious correlations rather than diagnostic criteria, while models that focus on micromorphologically relevant features show lower overall performance. Expert agreement on the same task is low, with uncertainty concentrated in feature detection rather than classification. Crucially, model and expert errors are largely independent, and each captures different aspects of frost feature recognition, establishing a basis for complementarity. Conclusions These findings demonstrate that effective computational integration in micromorphology requires not only accurate classification but interpretability validation ensuring that model’s reason from the same diagnostic criteria as experts. We propose a human-in-the-loop approach where models provide consistent first screening, while experts offer contextual interpretation and diagnostic validation. Additionally, we present an interactive open-access tool that implements this pipeline to facilitate adoption and repeatability.
Kouki et al. (Sat,) studied this question.