What type of study is this?

This is a Quantitative Study study.

September 28, 2025Open Access

Leveraging Sparse Autoencoders to Reveal Interpretable Features in Geophysical Models

Key Points

The framework reveals monosemantic behavior in neurons, indicating they respond to specific physical phenomena.
Regularized sparse autoencoders reduced sensitive inputs per neuron by 33% within multilayer perceptrons.
Application to precipitation classification shows clearer ties to temperature and fall speed thresholds.
This approach enhances the interpretability of neural network activations in geoscience applications.

Abstract

Abstract Machine learning is an increasingly popular tool in the geosciences, offering new approaches to numerical weather prediction and complex data set analysis. However, as reliance on these techniques grows, pressing questions about model transparency, internal biases, and trust emerge. Although post hoc explainability analyses can provide insights on how neural network (NN) outputs are generated, a robust framework for interpreting internal decision‐making remains underdeveloped. We address this challenge by exploring a framework to better understand the inner structure of NNs using sparse autoencoders (SAEs). With simplified multilayer perceptrons (MLPs), we demonstrate that hidden layer neurons often exhibit polysemantic behavior where each feature is mapped to a linear combination of neurons, creating an overcomplete representation. This phenomenon, known as superposition, arises when networks encode more features than available neurons, causing neurons to respond to multiple, seemingly unrelated inputs. By introducing a regularized SAE that learns from the original MLP's activations, we can disentangle these representations resulting in a 33% reduction in the average number of sensitive inputs per neuron. Applied to a precipitation classification model, this framework reveals evidence of monosemantic behavior in which neurons respond to a single meaningful concept tied to specific physical phenomena such as temperature and fall speed thresholds for precipitation phase partitioning. We observe similar monosemantic behavior in SAE activations from a snowfall rate regressor related to particle concentration intensity and vertical radar structures. This framework supports the development of more physically consistent interpretations of hidden neuron activations and improved trust in operational ML models across the geosciences.

Leveraging Sparse Autoencoders to Reveal Interpretable Features in Geophysical Models

Key Points

Abstract

Cite This Study

Also Consider

Also Consider