Social media data and machine learning methods for automated content analysis are increasingly being used in ecology and conservation science. A current limitation is the lack of methods for automated multimodal analysis of textual and visual content among other data modalities. In this study, we introduce a multimodal content analysis method applied to the investigation of wildlife trade on YouTube. Our approach consists of analyzing text through transformer based neural networks and video keyframes using convolutional neural networks as part of multimodal filtering followed by classification where a decision fusion module identifies instances of wildlife trade. The decision fusion module achieved an F-score of 0.72 among textual classifiers for trade detection and of 0.77 among visual classifiers for species identification. This multimodal classification helped detect wildlife trade in 3,715 out of 86,321 filtered YouTube posts, featuring 226 species for sale, including 51 Critically Endangered, 62 Endangered, 60 Vulnerable, 25 Near Threatened, and 28 Least Concern species. The proposed multimodal learning methods can be used more broadly for other ecological and biodiversity conservation applications.
Momeny et al. (Mon,) studied this question.