Large language models are increasingly deployed in everyday question-answering tasks, yet their performance across diverse, real-world queries remains underexplored. This study evaluates the semantic alignment of ChatGPT’s responses with human-selected best answers in an open-domain question answering (QA) setting, using data from the Yahoo! Answers platform. Distinct from prior research that centers on domain-specific datasets, this work explores ChatGPT’s general-purpose QA performance across a wide range of topics. BERTopic is used to extract latent themes from 500 diverse full-question samples, and BERTScore metrics (precision, recall, F1) are applied to quantify semantic similarity between ChatGPT-generated responses and top-rated community answers. Results indicate that ChatGPT (GPT-3.5-turbo) achieves a strong average F1-score of 0.824, reflecting high alignment with human judgments. Topic-level analysis reveals that ChatGPT performs particularly well on factual and encyclopedia-style questions, while performance varies across more subjective or open-ended topics. This study introduces a topic-sensitive evaluation framework that enhances understanding of large language model behavior in real-world QA scenarios and supports the development of more effective and explainable conversational artificial intelligence (AI) systems.
Mashael M. Alsulami (Fri,) studied this question.