I apply the multiple sequence alignment (MSA) pipeline described in Jäger (2025) at global scale to produce a phylogenetic tree of 3,397 world languages. Starting from 185 Lexibank datasets, I align lexical forms for 103 selected concepts using a pair Hidden Markov Model (pHMM) trained to discriminate word pairs from linguistically proximatelanguage pairs against random pairs, and a T-Coffee progressive alignment scheme. Concept selection is guided by PyPythia phylogenetic difficulty scores: I find an optimal subset of k = 103 concepts (out of 210 candidates) that maximises phylogenetic signal. The resulting character matrix (3,397 taxa, 93,504 binary characters) is analysed with RAxML-NG under a binary substitution model. The best maximum-likelihood tree achieves a Generalised Quartet Distance (GQD; ?) of 0.036 against the Glottolog expert classification, corresponding to 96.4% quartet consistency. At the family level, 74 of 113 tested families (65.5%) are recovered as monophyletic. The main source of error is areal signal in mainland Southeast Asia (MSEA): 137 Austroasiatic, Hmong-Mien, and Tai-Kadai languages are placed inside the Sino-Tibetan clade due to shared contact-induced vocabulary and a transcription artefact in the ASJP encoding of tonal languages. I release the ultrametric tree, character matrix, and all per-concept alignments as a replication package.
Building similarity graph...
Analyzing shared references across papers
Loading...
Gerhard Jäger (Fri,) studied this question.
synapsesocial.com/papers/69d8968f6c1944d70ce08015 — DOI: https://doi.org/10.57754/fdat.ccfpp-z0113
Gerhard Jäger
University of Tübingen
Building similarity graph...
Analyzing shared references across papers
Loading...
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: