What does this research mean for the field?

A multiple sequence alignment pipeline using 103 optimal lexical concepts can reconstruct a global phylogenetic tree of 3,397 languages that achieves 96.4% consistency with established expert classifications. Novelty: ClaimNovelty.CONFIRMATORY. Consensus alignment: ConsensusAlignment.SUPPORTS_CONSENSUS.

What question did this study set out to answer?

The aim is to create a global phylogenetic tree for 3,397 languages using multiple sequence alignment techniques.

April 10, 2026Open Access

A Phylogenetic Tree of 3,397 World Languages via Multiple Sequence Alignment

Read Full Paperexternally

Key Points

The aim is to create a global phylogenetic tree for 3,397 languages using multiple sequence alignment techniques.
Applied multiple sequence alignment pipeline at a global scale.
Aligned lexical forms for 103 concepts using a pair Hidden Markov Model.
Selected concepts based on phylogenetic difficulty scores.
Analyzed a character matrix with RAxML-NG under a binary substitution model.
Achieved a Generalised Quartet Distance of 0.036 relative to Glottolog classification.
Recovered 65.5% of tested language families as monophyletic.
Identified areal signal as the main source of error in Southeast Asia.

Abstract

I apply the multiple sequence alignment (MSA) pipeline described in Jäger (2025) at global scale to produce a phylogenetic tree of 3,397 world languages. Starting from 185 Lexibank datasets, I align lexical forms for 103 selected concepts using a pair Hidden Markov Model (pHMM) trained to discriminate word pairs from linguistically proximatelanguage pairs against random pairs, and a T-Coffee progressive alignment scheme. Concept selection is guided by PyPythia phylogenetic difficulty scores: I find an optimal subset of k = 103 concepts (out of 210 candidates) that maximises phylogenetic signal. The resulting character matrix (3,397 taxa, 93,504 binary characters) is analysed with RAxML-NG under a binary substitution model. The best maximum-likelihood tree achieves a Generalised Quartet Distance (GQD; ?) of 0.036 against the Glottolog expert classification, corresponding to 96.4% quartet consistency. At the family level, 74 of 113 tested families (65.5%) are recovered as monophyletic. The main source of error is areal signal in mainland Southeast Asia (MSEA): 137 Austroasiatic, Hmong-Mien, and Tai-Kadai languages are placed inside the Sino-Tibetan clade due to shared contact-induced vocabulary and a transcription artefact in the ASJP encoding of tonal languages. I release the ultrametric tree, character matrix, and all per-concept alignments as a replication package.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Gerhard Jäger

Actions

Institutions

University of Tübingen

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

A Phylogenetic Tree of 3,397 World Languages via Multiple Sequence Alignment

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Also consider