What question did this study set out to answer?

The aim is to analyze the restoration of diacritics in the Serbo-Croatian language, particularly in stripped forms.

June 26, 2026Open Access

Diacritic Restoration in the Serbo-Croatian Macro-Language: diacritic-stripped writing (šišana) as a written register and an on-device task

Key Points

The aim is to analyze the restoration of diacritics in the Serbo-Croatian language, particularly in stripped forms.
Comparison of three independently compiled standard lexicons for diacritic restoration.
Validation through an independent Universal Dependencies benchmark against a corpus-trained tool.
Native-speaker validation study outlined.
Over 99.5% standard independence in diacritization of stripped forms across different lexicons.
Achieved near-ceiling accuracy in an offline dictionary restoration tool with a low false-positive rate.
Minimal foreign-language contamination noted in the shared lexicon.

Abstract

Preprint 2 of a study of diacritic restoration (šišana/dešišavanje) in the Bosnian–Croatian–Serbian (BCS) Latin standards. Comparing three independently compiled standard lexicons, it shows that the diacritization of a stripped form is standard-independent for more than 99. 5% of the lexicon: standard differences surface as different stripped forms, not as different diacritizations of the same form. The result is corroborated by the minimal foreign-language contamination of the shared lexicon and by the structural asymmetry of detecting the standard from text, and is supported by an independent Universal Dependencies benchmark against the corpus-trained REDI tool, on which a compact (~12. 8 MB) offline dictionary restorer reaches near-ceiling accuracy at the lowest false-positive rate — a different point on the accuracy/footprint trade-off rather than a claim of superior accuracy. A native-speaker validation study is outlined. This deposit contains the article in two language versions: English (preprint2deposit. pdf, with full data appendices) and a neutral Serbo-Croatian/BCS version (preprint2depositₕbs. pdf). Full derived data lists are reproducible from open frequency sources via the included pipeline and are available from the author on request.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper