Large language models (LLMs) are being adopted rapidly in financial institutions for applications including customer communication, compliance review, fraud detection, and agentic workflows, but without bias evaluation, they risk reinforcing systemic biases that may lead to unethical or unlawful decisions. To address potential systemic bias in LLMs in regulated settings like financial services, we present a statistical analysis framework and structured, reproducible methodology for evaluating whether LLM outputs vary significantly across demographic groups. Using financial fraud stories from the CNN/DailyMail dataset, we employ substitution-based identity variations across protected demographic classes, generate summaries via three proprietary language models, and perform statistical analysis on common metrics (ROUGE, BERTScore, Adverse Impact Ratio (AIR), and Standardized Mean Difference (SMD)). Statistical approaches such as MANOVA and ANOVA reveal small but significant differences in output metric values (e.g., for White female, Black male, and Asian male identities in our analysis), while sentiment analysis and human evaluation confirm disparities in tone and framing. Our results also indicate that measured disparities appear to decrease across subsequent model generations.
Tasneem et al. (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: