I apply the multiple sequence alignment (MSA) pipeline described in Jäger (2025) at global scale to produce a phylogenetic tree of 3,397 world languages. Starting from 185 Lexibank datasets, I align lexical forms for 103 selected concepts using a pair Hidden Markov Model (pHMM) trained to discriminate word pairs from linguistically proximatelanguage pairs against random pairs, and a T-Coffee progressive alignment scheme. Concept selection is guided by PyPythia phylogenetic difficulty scores: I find an optimal subset of k = 103 concepts (out of 210 candidates) that maximises phylogenetic signal. The resulting character matrix (3,397 taxa, 93,504 binary characters) is analysed with RAxML-NG under a binary substitution model. The best maximum-likelihood tree achieves a Generalised Quartet Distance (GQD; ?) of 0.036 against the Glottolog expert classification, corresponding to 96.4% quartet consistency. At the family level, 74 of 113 tested families (65.5%) are recovered as monophyletic. The main source of error is areal signal in mainland Southeast Asia (MSEA): 137 Austroasiatic, Hmong-Mien, and Tai-Kadai languages are placed inside the Sino-Tibetan clade due to shared contact-induced vocabulary and a transcription artefact in the ASJP encoding of tonal languages. I release the ultrametric tree, character matrix, and all per-concept alignments as a replication package.
Building similarity graph...
Analyzing shared references across papers
Loading...
Gerhard Jäger
University of Tübingen
Building similarity graph...
Analyzing shared references across papers
Loading...
Gerhard Jäger (Fri,) studied this question.
synapsesocial.com/papers/69d8968f6c1944d70ce08015 — DOI: https://doi.org/10.57754/fdat.ccfpp-z0113
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: