March 3, 2026Open Access

A two-stage registry-anchored approach for precision improvement in organization name recognition from PubMed affiliation strings: a validation study

Key Points

Our model achieved a precision of 0.97 and recall of 0.93, significantly surpassing generic tools.
The comparative evaluation used nonparametric paired comparison tests and bootstrap methods for effectiveness assessment.
A two-stage model incorporated exact matching with geographic validation followed by selective fuzzy matching for remaining affiliations.
Results indicate that this tiered strategy is essential to prevent false positives and enhance metadata quality.

Abstract

Purpose: Reliable bibliometric analysis requires the accurate linkage of heterogeneous affiliation strings to persistent organizational identifiers. Generic natural language processing tools frequently fail at this task because they tend to prioritize coverage rather than precision. This study evaluated whether anchoring an entity-linking model to the Research Organization Registry improved precision relative to generic tools.Methods: We developed a conservative, two-stage model. First, using a normalized registry corpus, we applied rule-based exact matching with geographic validation. Second, selective fuzzy matching was applied only to the remaining nonmatched affiliations. We evaluated model performance against an off-the-shelf spaCy named entity recognition baseline using a manually adjudicated gold standard dataset derived from PubMed Digital Health records. Finally, we assessed the comparative advantage of our model using nonparametric paired comparison tests and bootstrap methods.Results: Our two-stage approach achieved substantially higher precision (0.97) and recall (0.93) than both the generic baseline (precision, 0.75; recall, 0.47) and unconstrained fuzzy matching models (precision, 0.77; recall, 0.83). This balanced improvement in precision and recall resulted in the highest F1 score (0.95). The ablation study further confirmed that the “exact matching first” strategy was structurally necessary to prevent the inflation of false positives observed when unconstrained fuzzy matching was applied.Conclusion: Anchoring entity resolution to a canonical registry using a tiered matching strategy substantially enhances the precision of institutional attribution. This approach provides a robust method for correcting metadata quality in editorial and repository workflows.

Bookmark

View Full Paper

Bookmark

View Full Paper

A two-stage registry-anchored approach for precision improvement in organization name recognition from PubMed affiliation strings: a validation study

Key Points

Abstract

Cite This Study