What question did this study set out to answer?

The study aims to improve topic modeling in agricultural texts through the Syntopextract framework, focusing on semantic disambiguation.

April 22, 2026Open Access

Syntopextract—extracting agricultural topics using polysemy and topic coherence

Key Points

The study aims to improve topic modeling in agricultural texts through the Syntopextract framework, focusing on semantic disambiguation.
Developed the Syntopextract framework using synset-based semantic disambiguation.
Analyzed a corpus of 320 official agricultural documents from the Government of India.
Applied lemmatization, TF-IDF filtering, and LDA topic analysis to evaluate thematic coherence.
Reduced vocabulary dimensionality by 52% while producing 10 coherent themes.
Achieved a C_V coherence score of 0.84 and a U_Mass score of -10.8, indicating high semantic quality.
Provided more interpretable topic definitions than conventional models like LDA and CTM.

Abstract

Agricultural research and policy documents exhibit high levels of lexical redundancy and polysemy, making it difficult to interpret topic results derived from traditional word-based models. This paper proposes a new synset-based framework, called Syntopextract, that applies semantic disambiguation before topic/theme inference to improve the thematic quality of agricultural corpora. The corpus for this study consisted of 320 Government of India (approx. 6. 2 million tokens) official documents, processed using lemmatization, Term Frequency–Inverse Document Frequency (TF–IDF) term filtering, WordNet-based sense disambiguation, and term compression into semantic-weighted synsets before LDA topic analysis. The Syntopextract approach reduced vocabulary dimensionality by 52% while producing 10 coherent themes with greater semantic quality than the baseline models, as indicated by a CV coherence score of 0. 84, a UMass score of -10. 8, and an average β membership score of 0. 9. Syntopextract effectively establishes polysemous term definitions (e. g. , train → mentor/guide; or field → farmland), resulting in more interpretable and stable topic definitions than LDA, LDA–Word2Vec, CTM, and BERTopic for the evaluated agricultural policy corpus, as reflected by coherence and stability metrics. As such, Syntopextract represents the first synset-driven method for topic modeling in agricultural text mining and establishes a strong methodological framework for extracting meaningful themes from complex government documents. Ultimately, the enhanced semantic quality provided by Syntopextract strengthens the utility of topic models for agricultural knowledge management for researchers and policymakers.

Bookmark

View Full Paper

Cite This Study

Bafna et al. (Mon,) studied this question.

synapsesocial.com/papers/69e866c96e0dea528ddeb26d https://doi.org/https://doi.org/10.1007/s10791-026-10089-x

Bookmark

View Full Paper