What question did this study set out to answer?

The study aims to evaluate potential systemic bias in large language model outputs across different demographic groups using financial data.

May 24, 2026Open Access

Bias Evaluation in Large Language Model Summaries Using Financial Crimes Data

Key Points

The study aims to evaluate potential systemic bias in large language model outputs across different demographic groups using financial data.
Employed a statistical analysis framework for bias evaluation in LLM outputs.
Utilized CNN/DailyMail dataset with identity variations across protected demographic classes.
Applied statistical methods including MANOVA and ANOVA and assessed metrics such as ROUGE and BERTScore.
Statistical analysis reveals small but significant differences in output metric values across demographic identities.
Sentiment analysis confirms tone and framing disparities in LLM outputs.
Measured disparities appear to decrease with subsequent model generations.

Abstract

Large language models (LLMs) are being adopted rapidly in financial institutions for applications including customer communication, compliance review, fraud detection, and agentic workflows, but without bias evaluation, they risk reinforcing systemic biases that may lead to unethical or unlawful decisions. To address potential systemic bias in LLMs in regulated settings like financial services, we present a statistical analysis framework and structured, reproducible methodology for evaluating whether LLM outputs vary significantly across demographic groups. Using financial fraud stories from the CNN/DailyMail dataset, we employ substitution-based identity variations across protected demographic classes, generate summaries via three proprietary language models, and perform statistical analysis on common metrics (ROUGE, BERTScore, Adverse Impact Ratio (AIR), and Standardized Mean Difference (SMD)). Statistical approaches such as MANOVA and ANOVA reveal small but significant differences in output metric values (e.g., for White female, Black male, and Asian male identities in our analysis), while sentiment analysis and human evaluation confirm disparities in tone and framing. Our results also indicate that measured disparities appear to decrease across subsequent model generations.

Read Full Paperexternally

Ask AI

Mark Helpful

Bookmark

Relay

View Full Paper