Agricultural research and policy documents exhibit high levels of lexical redundancy and polysemy, making it difficult to interpret topic results derived from traditional word-based models. This paper proposes a new synset-based framework, called Syntopextract, that applies semantic disambiguation before topic/theme inference to improve the thematic quality of agricultural corpora. The corpus for this study consisted of 320 Government of India (approx. 6. 2 million tokens) official documents, processed using lemmatization, Term Frequency–Inverse Document Frequency (TF–IDF) term filtering, WordNet-based sense disambiguation, and term compression into semantic-weighted synsets before LDA topic analysis. The Syntopextract approach reduced vocabulary dimensionality by 52% while producing 10 coherent themes with greater semantic quality than the baseline models, as indicated by a CV coherence score of 0. 84, a UMass score of -10. 8, and an average β membership score of 0. 9. Syntopextract effectively establishes polysemous term definitions (e. g. , train → mentor/guide; or field → farmland), resulting in more interpretable and stable topic definitions than LDA, LDA–Word2Vec, CTM, and BERTopic for the evaluated agricultural policy corpus, as reflected by coherence and stability metrics. As such, Syntopextract represents the first synset-driven method for topic modeling in agricultural text mining and establishes a strong methodological framework for extracting meaningful themes from complex government documents. Ultimately, the enhanced semantic quality provided by Syntopextract strengthens the utility of topic models for agricultural knowledge management for researchers and policymakers.
Bafna et al. (Mon,) studied this question.