What question did this study set out to answer?

This research aims to evaluate how well ChatGPT aligns semantically with community-generated responses in question answering.

January 18, 2026Open Access

A topic-aware evaluation of ChatGPT’s semantic alignment with community answers using BERTScore and BERTopic

Key Points

This research aims to evaluate how well ChatGPT aligns semantically with community-generated responses in question answering.
Evaluated ChatGPT's responses using data from Yahoo! Answers.
Used BERTopic to identify themes from 500 varied questions.
Applied BERTScore metrics (precision, recall, F1) to measure similarity between responses.
ChatGPT (GPT-3.5-turbo) achieved an average F1-score of 0.824, indicating strong alignment with community answers.
Performance varies with question type, excelling in factual inquiries but differing on subjective topics.

Abstract

Large language models are increasingly deployed in everyday question-answering tasks, yet their performance across diverse, real-world queries remains underexplored. This study evaluates the semantic alignment of ChatGPT’s responses with human-selected best answers in an open-domain question answering (QA) setting, using data from the Yahoo! Answers platform. Distinct from prior research that centers on domain-specific datasets, this work explores ChatGPT’s general-purpose QA performance across a wide range of topics. BERTopic is used to extract latent themes from 500 diverse full-question samples, and BERTScore metrics (precision, recall, F1) are applied to quantify semantic similarity between ChatGPT-generated responses and top-rated community answers. Results indicate that ChatGPT (GPT-3.5-turbo) achieves a strong average F1-score of 0.824, reflecting high alignment with human judgments. Topic-level analysis reveals that ChatGPT performs particularly well on factual and encyclopedia-style questions, while performance varies across more subjective or open-ended topics. This study introduces a topic-sensitive evaluation framework that enhances understanding of large language model behavior in real-world QA scenarios and supports the development of more effective and explainable conversational artificial intelligence (AI) systems.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Mashael M. Alsulami (Fri,) studied this question.

synapsesocial.com/papers/696c77afeb60fb80d1395dfa https://doi.org/https://doi.org/10.7717/peerj-cs.3446

Bookmark

View Full Paper