The rapidly growing use of large language models (LLMs) in high-stakes settings, such as venture capital screening, often relies on an implicit assumption that sufficiently advanced models will produce broadly comparable outputs. This study revisits that assumption and finds limited support for it. Using three leading models—GPT-4o, Claude 3.5 Sonnet, and DeepSeek-V2—we observe systematic and statistically significant differences in how investment evaluations are formed through a controlled simulation design; each model evaluated 20 real startup pitch decks spanning multiple industries and funding stages. To account for stochastic variation in outputs, each model pair was evaluated five times under identical conditions. This allows us to distinguish between one-off variation and more persistent behavioral tendencies. The results reveal consistent, reproducible differences across models in funding recommendations, evaluation scores, and expressed confidence. Also, reliability varies substantially across models, with ICC values ranging from 0.240 to 0.930. This suggests that model performance is not only about average behavior, but also about the stability of that behavior under repeated evaluation. Three behavioral profiles emerge. GPT-4o can be characterized as a cautious allocator, combining relatively favorable evaluations with conservative funding decisions. DeepSeek-V2 appears as a conservative scorer, applying more stringent and highly consistent evaluations while systematically underfunding. Claude 3.5 Sonnet aligns with a narrative funder profile, showing greater responsiveness to qualitative aspects of the pitch, somewhat higher funding levels, and strong cross-run reliability. These findings indicate that different models embed different evaluation logics, and these differences are large enough to shape outcomes in practice. Given the limited sample size, the results should be interpreted as exploratory. Even so, they point to the importance of incorporating reliability alongside average performance when assessing and deploying LLMs in high-stakes decision contexts.
Buranasomphop et al. (Mon,) studied this question.