What question did this study set out to answer?

The study aims to evaluate how cognitive bias-inducing prompts affect large language models' reasoning abilities.

February 5, 2026Open Access

Exposing the fragility of LLM reasoning through bias-inducing prompts: evidence from BiasMedQA

Key Points

The study aims to evaluate how cognitive bias-inducing prompts affect large language models' reasoning abilities.
Evaluated Llama-3.3-70B, Qwen3-32B, and Gemini-2.5-Flash on the BiasMedQA dataset.
Models tested against base, debiasing, and few-shot prompts.
Mixed-effects logistic regression models applied to assess biases impact on performance.
Reasoning-enhanced models showed higher correct response rates across all the evaluations.
Gemini-2.5-Flash's performance significantly dropped with additional bias prompts, indicating potential data contamination.
Debiasing and few-shot prompts successfully reduced biased responses across all models.

Abstract

Objectives To evaluate the impact of large language model (LLM) reasoning on model susceptibility to cognitive bias-inducing prompts. Methods and analysis The performance of Llama-3.3-70B, Qwen3-32B and Gemini-2.5-Flash, along with their reasoning-enhanced variants, was evaluated in the public BiasMedQA dataset developed to evaluate seven established cognitive biases in 1273 clinical case vignettes. Each model was tested using a base prompt, a debiasing prompt with the instruction to actively mitigate cognitive bias and a few-shot prompt with additional sample cases of biased responses. Beyond the seven biases from BiasMedQA, Gemini-2.5-Flash was additionally tested using four unpublished bias-inducing prompts to unveil signs of potential data contamination and actively investigate brittleness. For each model pair, two mixed-effects logistic regression models were fitted to determine the impact of biases and mitigation strategies on performance. Results In all three models, the reasoning-enhanced variant achieved higher rates of correct responses (Llama-3.3-70B: 72.5–82.1% vs 61.0–73.4%, Qwen3-32B: 71.7–78.7% vs 55.5–64.1%, Gemini-2.5-Flash: 81.8–88.6% vs 80.0–83.7%). The performance of Gemini-2.5-Flash dropped considerably when exposing it to four additional unpublished bias-inducing prompts (from 80.0–88.6% to 47.4–86.1%), hinting at potential contamination of its training data and exposing underlying brittleness. In Llama-3.3-70B and Gemini-2.5-Flash, reasoning amplified model vulnerability to several bias-inducing prompts, while reasoning reduced susceptibility of Qwen3-32B to one of the seven biases. The debiasing and few-shot prompting approaches demonstrated statistically significant reductions in biased responses across all three model architectures. Conclusion In none of the three LLMs, reasoning was able to consistently reduce vulnerability to bias-inducing prompts, revealing the fragility of the reasoning capabilities purported by the model developers.

Read Full Paperexternally

Ask AI

Mark Helpful

Bookmark

Relay

View Full Paper