Understanding the structure and variability of animal acoustic repertoires is essential for studying communication and behavior. In this project, we investigated unsupervised deep learning approaches to cluster humpback whale social calls collected from Morro Bay (California), Monterey Bay (California), and Newport (Oregon). Over 6000 calls were manually annotated into 12 broad categories based on call structure observed in spectrograms. Using the DeepAcoustics tool, we applied two distinct deep learning strategies—VisualGeometry Group (VGG) and Variational Autoencoders (VAEs)—to cluster the bounding-box annotated calls. These approaches differ in how they interpret and represent input data: VGG applies a hierarchical convolutional structure to extract fixed visual features from images, while VAEs use an encoder–decoder architecture to learn compressed, lower-dimensional representations that capture variation in the input data, allowing for grouping of similar patterns and identification of outliers. We evaluated how well each method captured structural variation among call types and examined the influence of recording artifacts and background noise on clustering performance. We share important trade-offs between model type, sensitivity to noise, and ability to distinguish nuanced call features. We also discuss the potential benefits of using deep learning architectures developed for raw audio formats to improve clustering of data.
Ferguson et al. (Wed,) studied this question.