What question did this study set out to answer?

The aim is to design a universal written language that can bridge communication gaps between different dialects and AI systems.

April 6, 2026Open Access

After Babel: A Computational Framework for a Universal Written Language Across Human and Artificial Intelligence

Key Points

The aim is to design a universal written language that can bridge communication gaps between different dialects and AI systems.
Analyzed historical evidence of successful unifying writing systems
Developed a computational methodology combining LaBSE and TF-IDF
Established glyph design principles based on Chinese radicals and tokenization
Applied mutual information framework to evaluate dialect-proof stability
Demonstrated that phonological divergence does not lead to script divergence
Established a framework preventing the rebus principle from recurring
Identified 400-600 universal semantic root concepts through a cross-linguistic inventory

Abstract

Human written communication has been fragmenting since Babel. The Western alphabetic tradition — anchoring symbols to sounds rather than meanings — ensured that every spoken dialect eventually became a separate written language. Latin became five incompatible Romance scripts within five centuries. The artificial intelligence systems of the 21st century are reproducing this fragmentation at civilisational scale: trained predominantly on English, they process the world in English conceptual structures regardless of the language in which they ultimately communicate, encoding one culture's cognitive architecture into the reasoning layer of systems that billions of people rely upon daily. This paper proposes a solution whose feasibility is demonstrated by 2,200 years of unbroken evidence. When Qin Shi Huang's chancellor Li Si standardised the Chinese writing system in 221 BCE through the reform known as 书同文, he anchored symbols to meaning rather than sound. The character 山 means MOUNTAIN to speakers of Cantonese, Mandarin, and Hokkien — three mutually unintelligible spoken dialects sharing one written symbol, substantially unchanged across every dynasty and conquest since. Music provides parallel evidence in the sonic domain: by communicating through rhythm, melody, and harmonic contour rather than phoneme sequences, music achieves near-zero phonological coupling and has unified human emotional expression across 315 documented societies for forty thousand years. Three additional empirical validations ground the argument: the Korean-Japanese contrast shows Japan retaining logographic anchoring (OCI ≈ 0.25) and maintaining written unity across all dialects, while Korea's adoption of the phonographic Hangul alphabet (OCI ≈ 0.72) produced measurable written divergence between North and South within eighty years of political separation; Egyptian hieroglyphics provide the historical warning that a writing system beginning at OCI ≈ 0 can be corrupted by phonographic drift — the rebus principle raised Egyptian OCI above zero until the script became unreadable within a generation of its last inscription and required the Rosetta Stone to decode. The computational methodology for designing this universal script draws on three generations of NLP text representation. Bag of Words and TF-IDF (first generation: statistical, interpretable, language-specific) identify distinctive concepts within individual language corpora but cannot bridge language boundaries. Word2Vec, GloVe, and FastText (second generation: static neural embeddings) capture semantic relationships within a language but remain language-specific. LaBSE and multilingual BERT (third generation: contextual LLM embeddings) produce a shared semantic space across 109 languages where the vectors for 'water' in English, '水' in Mandarin, and 'maji' in Swahili cluster in the same region. The Universal-TF-IDF methodology proposed here combines all three: LaBSE maps tokens from all languages into a shared semantic space, then inverted TF-IDF identifies concepts with high frequency across all language corpora — universals rather than discriminators — mining a cross-linguistically validated inventory of 400–600 universal semantic root concepts. The glyph design principles draw on the Chinese radical tradition and its structural isomorphism with modern Byte Pair Encoding tokenization. A three-level architecture maps directly: root glyphs (原符) correspond to atomic units; concept characters (意符) to subword tokens; compound words (合符) to word tokens. Extensibility without fragmentation is governed by three Chinese terms that define three distinct theoretical dimensions: 词元 (cíyuán, how concepts function as NLP computational tokens), 代币 (dàibì, how new concepts are minted and ratified through a blockchain-derived governance protocol), and 象征 (xiàngzhēng, what glyphs are as meaning-anchored symbols pointing to physical reality). The governance architecture explicitly prevents the rebus principle from recurring: any proposed 词元 that encodes phonological information rather than physical-domain meaning fails the Composition Grammar validation and cannot advance to the Stable Core registry. The formal proof of dialect-proof stability is derived through Shannon's mutual information framework. The Orthophonemic Coupling Index (OCI = I(G; P) / H(P)) measures where any writing system sits on the spectrum from fully phonographic to fully semantic. The Channel Separation Theorem (Theorem 7.1) proves that when I(G; P) = 0, semantic transmission fidelity is invariant across all dialect conditions — phonological divergence has zero path to script divergence. The Fragmentation Rate Formula (EΔG(t) ∝ OCI₀ · σ² · t) reduces to zero for all time when OCI₀ = 0, making the system mathematically stable against fragmentation by proof rather than by institutional enforcement alone. Humanity already possesses two communication channels operating at OCI ≈ 0: music in the sonic domain (40,000+ years) and mathematical notation in the quantitative domain (300 years). This paper proposes the third such channel — the first designed for general written communication across human and artificial intelligence systems. Keywords: universal writing system, after Babel, logographic script, OCI, tokenization, dialect-proof, human-AI communication, 书同文, 词元, 代币, 象征, TF-IDF, LLM embeddings, Chinese radicals, semantic universals, Egyptian hieroglyphics, Korean-Japanese contrast, Channel Separation Theorem

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Wen Gio Lim

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

After Babel: A Computational Framework for a Universal Written Language Across Human and Artificial Intelligence

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider