Abstract Machine learning is an increasingly popular tool in the geosciences, offering new approaches to numerical weather prediction and complex data set analysis. However, as reliance on these techniques grows, pressing questions about model transparency, internal biases, and trust emerge. Although post hoc explainability analyses can provide insights on how neural network (NN) outputs are generated, a robust framework for interpreting internal decision‐making remains underdeveloped. We address this challenge by exploring a framework to better understand the inner structure of NNs using sparse autoencoders (SAEs). With simplified multilayer perceptrons (MLPs), we demonstrate that hidden layer neurons often exhibit polysemantic behavior where each feature is mapped to a linear combination of neurons, creating an overcomplete representation. This phenomenon, known as superposition, arises when networks encode more features than available neurons, causing neurons to respond to multiple, seemingly unrelated inputs. By introducing a regularized SAE that learns from the original MLP's activations, we can disentangle these representations resulting in a 33% reduction in the average number of sensitive inputs per neuron. Applied to a precipitation classification model, this framework reveals evidence of monosemantic behavior in which neurons respond to a single meaningful concept tied to specific physical phenomena such as temperature and fall speed thresholds for precipitation phase partitioning. We observe similar monosemantic behavior in SAE activations from a snowfall rate regressor related to particle concentration intensity and vertical radar structures. This framework supports the development of more physically consistent interpretations of hidden neuron activations and improved trust in operational ML models across the geosciences.
King et al. (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: