What question did this study set out to answer?

This study investigates the behavioral tendencies of different large language models in investment decision-making, challenging assumptions of neutrality.

April 29, 2026Open Access

Algorithmic personalities and the myth of neutrality: financial behavior of large language models in investment decision-making

Key Points

This study investigates the behavioral tendencies of different large language models in investment decision-making, challenging assumptions of neutrality.
Three leading LLMs were evaluated across 20 startup pitch decks.
Each model pair was tested five times under identical conditions.
Systematic differences in evaluations and funding recommendations were analyzed.
Significant differences in funding recommendations and evaluation scores were found across models.
Reliability varied, with ICC values ranging from 0.240 to 0.930.
Distinct behavioral profiles were identified for each model, influencing investment outcomes.

Abstract

The rapidly growing use of large language models (LLMs) in high-stakes settings, such as venture capital screening, often relies on an implicit assumption that sufficiently advanced models will produce broadly comparable outputs. This study revisits that assumption and finds limited support for it. Using three leading models—GPT-4o, Claude 3.5 Sonnet, and DeepSeek-V2—we observe systematic and statistically significant differences in how investment evaluations are formed through a controlled simulation design; each model evaluated 20 real startup pitch decks spanning multiple industries and funding stages. To account for stochastic variation in outputs, each model pair was evaluated five times under identical conditions. This allows us to distinguish between one-off variation and more persistent behavioral tendencies. The results reveal consistent, reproducible differences across models in funding recommendations, evaluation scores, and expressed confidence. Also, reliability varies substantially across models, with ICC values ranging from 0.240 to 0.930. This suggests that model performance is not only about average behavior, but also about the stability of that behavior under repeated evaluation. Three behavioral profiles emerge. GPT-4o can be characterized as a cautious allocator, combining relatively favorable evaluations with conservative funding decisions. DeepSeek-V2 appears as a conservative scorer, applying more stringent and highly consistent evaluations while systematically underfunding. Claude 3.5 Sonnet aligns with a narrative funder profile, showing greater responsiveness to qualitative aspects of the pitch, somewhat higher funding levels, and strong cross-run reliability. These findings indicate that different models embed different evaluation logics, and these differences are large enough to shape outcomes in practice. Given the limited sample size, the results should be interpreted as exploratory. Even so, they point to the importance of incorporating reliability alongside average performance when assessing and deploying LLMs in high-stakes decision contexts.

Ask AI

Mark Helpful

Bookmark

Relay

View Full Paper