What question did this study set out to answer?

This evaluation aims to assess the performance of large language models in named entity recognition across various datasets and domains.

April 6, 2026

Evaluating Large Language Models on Named Entity Recognition

Key Points

This evaluation aims to assess the performance of large language models in named entity recognition across various datasets and domains.
Evaluated twenty-eight LLMs on thirteen datasets across five domains
Analyzed LLM performance through supervised fine-tuning, parameter scales, hallucinations, and prompt designs
Developed an LLM-based NER framework with Recognition and Check phases for evaluation.
Supervised fine-tuning enhances LLMs' understanding of human instructions
LMM capabilities improve consistently with increasing parameter scales
Hallucinations occur in all LLMs, but the Check phase can help mitigate them
Prompt designs significantly influence LLM outcomes.

Abstract

Large language models (LLMs) are popping up all over the place, and they have been gaining prominence due to their exceptional abilities in conducting various tasks. Although extensive LLM evaluation has been explored on natural language understanding tasks like text classification and sentiment analysis, evaluating LLMs on named entity recognition (NER) still remains under-explored. To fill this gap, we evaluate twenty-eight representative LLMs on thirteen datasets across five domains, whose parameters range from 3 billion to 175billion, from four perspectives, that is, supervised fine-tuning (SFT), parameter scales, hallucinations, and prompt designs. We propose an LLM-based NER framework (LLM-NER) for the evaluation, which consists of a Recognition phase and a Check phase. Specifically, the Check guides LLMs to examine the correctness of recognized entities, which is designed to mitigate hallucinations in the NER scenario. Qualitative and quantitative evaluation analyses demonstrate that in the NER scenario: 1) SFT empowers LLMs to understand and follow human instructions; 2) LLMs' ability generally improves as their parameter scales consistently increase; 3) hallucinations exist in all evaluated LLMs, and guiding LLMs to check their outputs is a feasible way to alleviate hallucinations; and 4) all evaluated LLMs are sensitive to prompt designs. Based on the analyses, we highlight a number of promising directions for future study. Moreover, our evaluation shows high consistency with two LLM evaluation leaderboards, which evaluate LLMs on other tasks, demonstrating the rationality of our evaluation design.

Bookmark

Evaluating Large Language Models on Named Entity Recognition

Key Points

Abstract

Cite This Study

Also Consider

Also Consider