Abstract Historically, veterinary studies screening for breed, age and sex predisposition to disease have relied on collating small-scale studies of clinical datasets. The availability of larger datasets through groups such as the Small Animal Veterinary Surveillance Network (SAVSNET) promise access to information regarding a wide range of clinical presentations at scale, however, methodological limitations surrounding the extraction of specific disease information or screening for disease predispositions result in a substantial reduction in the number of animals studied. These studies often address very focused hypotheses - only leveraging a small fraction of the intrinsic value of the data at any one time. Here, we implemented an unsupervised machine learning methodology, creating a representation of a large volume of clinical notes collected by SAVSNET from veterinary practices across the UK. We utilise BERTopic, a topic-modelling tool based on Bidirectional Encoder Representations using Transformers (BERT) architecture, and show it is able to surface known phenotypes, such as breed predispositions to hypoadrenocorticism, diabetes mellitus and mitral valve disease, as well as potential novel patterns of disease phenotypes. This scalable and granular modelling technique facilitates the rapid interrogation of large clinical datasets, enabling the identification of a broad range of phenotypes within the population and the early detection of temporal changes indicative of emerging infectious or environmental diseases.
Noble et al. (Tue,) studied this question.