Large language models (LLMs) are popping up all over the place, and they have been gaining prominence due to their exceptional abilities in conducting various tasks. Although extensive LLM evaluation has been explored on natural language understanding tasks like text classification and sentiment analysis, evaluating LLMs on named entity recognition (NER) still remains under-explored. To fill this gap, we evaluate twenty-eight representative LLMs on thirteen datasets across five domains, whose parameters range from 3 billion to 175billion, from four perspectives, that is, supervised fine-tuning (SFT), parameter scales, hallucinations, and prompt designs. We propose an LLM-based NER framework (LLM-NER) for the evaluation, which consists of a Recognition phase and a Check phase. Specifically, the Check guides LLMs to examine the correctness of recognized entities, which is designed to mitigate hallucinations in the NER scenario. Qualitative and quantitative evaluation analyses demonstrate that in the NER scenario: 1) SFT empowers LLMs to understand and follow human instructions; 2) LLMs' ability generally improves as their parameter scales consistently increase; 3) hallucinations exist in all evaluated LLMs, and guiding LLMs to check their outputs is a feasible way to alleviate hallucinations; and 4) all evaluated LLMs are sensitive to prompt designs. Based on the analyses, we highlight a number of promising directions for future study. Moreover, our evaluation shows high consistency with two LLM evaluation leaderboards, which evaluate LLMs on other tasks, demonstrating the rationality of our evaluation design.
Ji et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: