Purpose: Reliable bibliometric analysis requires the accurate linkage of heterogeneous affiliation strings to persistent organizational identifiers. Generic natural language processing tools frequently fail at this task because they tend to prioritize coverage rather than precision. This study evaluated whether anchoring an entity-linking model to the Research Organization Registry improved precision relative to generic tools.Methods: We developed a conservative, two-stage model. First, using a normalized registry corpus, we applied rule-based exact matching with geographic validation. Second, selective fuzzy matching was applied only to the remaining nonmatched affiliations. We evaluated model performance against an off-the-shelf spaCy named entity recognition baseline using a manually adjudicated gold standard dataset derived from PubMed Digital Health records. Finally, we assessed the comparative advantage of our model using nonparametric paired comparison tests and bootstrap methods.Results: Our two-stage approach achieved substantially higher precision (0.97) and recall (0.93) than both the generic baseline (precision, 0.75; recall, 0.47) and unconstrained fuzzy matching models (precision, 0.77; recall, 0.83). This balanced improvement in precision and recall resulted in the highest F1 score (0.95). The ablation study further confirmed that the “exact matching first” strategy was structurally necessary to prevent the inflation of false positives observed when unconstrained fuzzy matching was applied.Conclusion: Anchoring entity resolution to a canonical registry using a tiered matching strategy substantially enhances the precision of institutional attribution. This approach provides a robust method for correcting metadata quality in editorial and repository workflows.
Kang et al. (Mon,) studied this question.