July 3, 2024Open Access

Large Language Models as Evaluators for Scientific Synthesis

Key Points

Key points are not available for this paper at this time.

Abstract

Our study explores how well the state-of-the-art Large Language Models (LLMs), like GPT-4 and Mistral, can assess the quality of scientific summaries or, more fittingly, scientific syntheses, comparing their evaluations to those of human annotators. We used a dataset of 100 research questions and their syntheses made by GPT-4 from abstracts of five related papers, checked against human quality ratings. The study evaluates both the closed-source GPT-4 and the open-source Mistral model's ability to rate these summaries and provide reasons for their judgments. Preliminary results show that LLMs can offer logical explanations that somewhat match the quality ratings, yet a deeper statistical analysis shows a weak correlation between LLM and human ratings, suggesting the potential and current limitations of LLMs in scientific synthesis evaluation.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Evans et al. (Wed,) studied this question.

synapsesocial.com/papers/68e61806b6db6435875aa864 — DOI: https://doi.org/10.48550/arxiv.2407.02977

Authors

Julia Evans

Technische Informationsbibliothek (TIB)

Jennifer D’Souza

Technische Informationsbibliothek (TIB)

Sören Auer

Leibniz University Hannover

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Large Language Models as Evaluators for Scientific Synthesis

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Also consider

Also consider