In five text corpora (Forums, Newsgroups, UFO, eBird Checklist, and eBird Species), totaling 516, 556 documents, each document was represented in a 100-dimensional space and assigned a local density value based on nearest neighbors. Using a fixed threshold per corpus, documents were split into two groups: dense (high local density) and sparse (the remainder). The fraction of documents classified as dense lies between 10. 0% and 10. 5% in every corpus, while the fraction classified as sparse lies between 90. 0% and 90. 5%. The five corpora differ in size (from 17, 242 to 217, 587 documents) and domain. The observation is documented in data/observationₜable. csv; the table and the proportions figure can be reproduced from the JSON files in data/ using code/reproduceₒbservation. py. This report is limited to documenting these proportions and does not interpret causes or generality.
Miguel Pavón (Wed,) studied this question.