What question did this study set out to answer?

The aim is to explore unsupervised deep learning techniques for clustering humpback whale social calls.

May 14, 2026

Clustering humpback whale calls using deep feature embeddings and latent representations

Key Points

The aim is to explore unsupervised deep learning techniques for clustering humpback whale social calls.
Manually annotated over 6000 calls into 12 categories based on call structure.
Applied Visual Geometry Group (VGG) and Variational Autoencoders (VAEs) for clustering.
Evaluated model performance concerning noise and variation in call features.
Identified key trade-offs between VGG and VAEs in clustering performance.
Demonstrated sensitivity to noise affecting clustering outcomes.
Showed enhanced ability to distinguish call features with advanced deep learning techniques.

Abstract

Understanding the structure and variability of animal acoustic repertoires is essential for studying communication and behavior. In this project, we investigated unsupervised deep learning approaches to cluster humpback whale social calls collected from Morro Bay (California), Monterey Bay (California), and Newport (Oregon). Over 6000 calls were manually annotated into 12 broad categories based on call structure observed in spectrograms. Using the DeepAcoustics tool, we applied two distinct deep learning strategies—VisualGeometry Group (VGG) and Variational Autoencoders (VAEs)—to cluster the bounding-box annotated calls. These approaches differ in how they interpret and represent input data: VGG applies a hierarchical convolutional structure to extract fixed visual features from images, while VAEs use an encoder–decoder architecture to learn compressed, lower-dimensional representations that capture variation in the input data, allowing for grouping of similar patterns and identification of outliers. We evaluated how well each method captured structural variation among call types and examined the influence of recording artifacts and background noise on clustering performance. We share important trade-offs between model type, sensitivity to noise, and ability to distinguish nuanced call features. We also discuss the potential benefits of using deep learning architectures developed for raw audio formats to improve clustering of data.

Ask AI

Mark Helpful

Bookmark

Relay