What question did this study set out to answer?

This research aims to analyze the distribution of Glyma gene mentions in soybean genomics literature across chromosomes.

April 12, 2026Open Access

Mining Two Decades of Soybean Genomics Literature Using Rule-Based Text Mining: Chromosome-Resolved Patterns of Glyma Gene Mentions

Key Points

This research aims to analyze the distribution of Glyma gene mentions in soybean genomics literature across chromosomes.
Conducted a bibliometric analysis of PubMed abstracts from December 2006 to December 2025.
Used rule-based text mining with regular expressions for entity extraction of Glyma gene identifiers.
Retrieved and analyzed 377 PubMed records for references to standardized Glyma gene identifiers.
340 abstracts (90.2%) contained at least one Glyma gene identifier.
The median number of unique genes per abstract was 1, with a maximum of 14 genes in a single abstract.
Identification of chromosome-level disparities, particularly high frequencies for chromosomes 3 and 16.

Abstract

Soybean (Glycine max L. Merr.) is a globally important crop with a rapidly expanding body of genomics literature driven by advances in sequencing and functional genomics. Thousands of studies reference soybean genes using standardized Glyma identifiers; however, systematic analyses of how these identifiers are distributed across chromosomes in the scientific literature remain limited. Here, we present a chromosome-resolved bibliometric analysis of soybean gene mentions using a reproducible rule-based text mining approach. PubMed abstracts published between December 2006 and December 2025 were mined for standardized Glyma gene identifiers using regular-expression-based entity extraction. A total of 377 PubMed records were retrieved, of which 340 abstracts (90.2%) contained at least one Glyma gene identifier. The median number of unique genes mentioned per abstract was 1, with a maximum of 14 genes reported in a single study. Our results reveal three major patterns. First, soybean genomics research remains predominantly gene-centric, with most abstracts referencing one or two genes. Second, apparent chromosome-level disparities exist in literature representation within the subset of studies using standardized Glyma identifiers, with chromosomes 3 and 16 exhibiting the highest frequencies of unique gene mentions. A Chi-square goodness-of-fit test confirmed that these differences deviate significantly from a uniform distribution (χ2 = 123.71, p < 0.001), indicating non-random patterns of gene reporting. Third, a small subset of genes dominates the literature, while the majority of annotated genes are mentioned infrequently, reflecting a long-tailed distribution of research attention. This analysis captures reporting patterns in studies that explicitly use standardized Glyma identifiers and therefore represents a defined subset of the broader soybean genomics literature. Within this scope, the findings highlight uneven adoption of standardized gene nomenclature and chromosome-level differences in research emphasis. More broadly, this study demonstrates the utility of transparent, rule-based text mining approaches for large-scale bibliometric analyses in plant science and provides a scalable framework for comparative analyses across crop species.

Mining Two Decades of Soybean Genomics Literature Using Rule-Based Text Mining: Chromosome-Resolved Patterns of Glyma Gene Mentions

Key Points

Abstract

Cite This Study