What type of study is this?

This is a Descriptive Comparative Study study.

September 10, 2025Open Access

Accuracy and Safety of ChatGPT-3.5 in Assessing Over-the-Counter Medication Use During Pregnancy: A Descriptive Comparative Study

Key Points

Responses generated by ChatGPT-3.5 scored high on correctness, with a median score of 5 out of 5.
In terms of completeness, ChatGPT-3.5 achieved a median score of 4, indicating good but not perfect information delivery.
Despite high accuracy, safety errors were present in 9% of evaluations, reflecting risk in using AI as a sole resource during pregnancy.
The independent ratings showed consensus on ChatGPT-3.5's limitations, emphasizing the need for expert consultation when assessing OTC medications.

Abstract

As artificial intelligence (AI) becomes increasingly utilized to perform tasks requiring human intelligence, patients who are pregnant may turn to AI for advice on over-the-counter (OTC) medications. However, medications used in pregnancy may pose profound safety concerns limited by data availability. This study focuses on a chatbot's ability to accurately provide information regarding OTC medications as it relates to patients that are pregnant. A prospective, descriptive design was used to compare the responses generated by the Chat Generative Pre-Trained Transformer 3.5 (ChatGPT-3.5) to the information provided by UpToDate®. Eighty-seven of the top pharmacist-recommended OTC drugs in the United States (U.S.) as identified by Pharmacy Times were assessed for safe use in pregnancy using ChatGPT-3.5. A piloted, standard prompt was input into ChatGPT-3.5, and the responses were recorded. Two groups independently rated the responses compared to UpToDate on their correctness, completeness, and safety using a 5-point Likert scale. After independent evaluations, the groups discussed the findings to reach a consensus, with a third independent investigator giving final ratings. For correctness, the median score was 5 (interquartile range IQR: 5-5). For completeness, the median score was 4 (IQR: 4-5). For safety, the median score was 5 (IQR: 5-5). Despite high overall scores, the safety errors in 9% of the evaluations (n = 8), including omissions that pose a risk of serious complications, currently renders the chatbot an unsafe standalone resource for this purpose.

Accuracy and Safety of ChatGPT-3.5 in Assessing Over-the-Counter Medication Use During Pregnancy: A Descriptive Comparative Study

Key Points

Abstract

Cite This Study