Biogeographical ancestry (BGA) inference is an important tool in forensic genetics. However, typical approaches often rely on predefined population labels and limited marker sets, which constrain both resolution and flexibility. In this study, we evaluate the potential of unsupervised feature selection for BGA inference using Sparse K-means with Feature Ranking (SKFR), as implemented in OpenADMIXTURE. We leveraged the largest dataset used in a forensic context to date, comprising approximately 6500 individuals genotyped at ∼600,000 SNPs on the Human Origins (HO) array. Based on this dataset, we evaluated SKFR-selected ancestry-informative marker (AIM) panels ranging from 1500 to 2200 SNPs. Clustering performance was assessed using OpenADMIXTURE and quantified with G' similarity. Among the tested panels, a 1,900-SNP panel showed the most consistent clustering results and was selected for further evaluation. To examine forensic relevance, we compared this panel to a randomly selected SNP panel of the same size. Both panels produced broadly similar clustering patterns with OpenADMIXTURE, likely reflecting the marker composition of the HO array. The performance of the 1900 SKFR-selected SNPs was then evaluated using GENOGEOGRAPHER, a likelihood-based tool for BGA inference. Assignment analyses within the held-out test set provided a detailed overview of concordant and discordant assignments under the chosen reference metapopulations. While differences between the SKFR and random panels were modest, the SKFR panel showed consistently stronger and more stable assignment performance, demonstrating that unsupervised marker selection can add value even under the constraints of SNP arrays enriched for ancestry-informative variants. Overall, our study offers a systematic critical evaluation of unsupervised AIM selection and its limitations in practical settings. We show that panel size, array ascertainment, and reference dataset composition jointly shape ancestry-inference performance, and we encourage inference approaches that are not tied to fixed marker panels but instead make use of as many informative SNPs as feasible.
A Mon, study studied this question.