March 3, 2026Open Access

Dense and sparse partition by local density in five text corpora: observed proportions (10.0–10.5% and 90.0–90.5%)

Key Points

Dense documents represent 10.0-10.5% across five different text corpora, while sparse documents account for 90.0-90.5%.
The analysis utilized local density values derived from nearest neighbors in a 100-dimensional space of 516,556 documents.
Each of the five text corpora varied in size, with document counts ranging from 17,242 to 217,587.
Findings highlight the fixed local density threshold used for classification without exploring underlying causes.

Abstract

In five text corpora (Forums, Newsgroups, UFO, eBird Checklist, and eBird Species), totaling 516, 556 documents, each document was represented in a 100-dimensional space and assigned a local density value based on nearest neighbors. Using a fixed threshold per corpus, documents were split into two groups: dense (high local density) and sparse (the remainder). The fraction of documents classified as dense lies between 10. 0% and 10. 5% in every corpus, while the fraction classified as sparse lies between 90. 0% and 90. 5%. The five corpora differ in size (from 17, 242 to 217, 587 documents) and domain. The observation is documented in data/observationₜable. csv; the table and the proportions figure can be reproduced from the JSON files in data/ using code/reproduceₒbservation. py. This report is limited to documenting these proportions and does not interpret causes or generality.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Miguel Pavón (Wed,) studied this question.

synapsesocial.com/papers/69a75bbbc6e9836116a239e2 https://doi.org/https://doi.org/10.5281/zenodo.18407378

Bookmark

View Full Paper