What question did this study set out to answer?

This research aims to compare the effectiveness of zero-shot API-driven LLMs and supervised fine-tuned models in predicting stock returns using financial news.

June 20, 2026Open Access

News to Numbers: A Comparative Analysis of Traditional and API-driven LLMs in Stock Return Prediction

Key Points

This research aims to compare the effectiveness of zero-shot API-driven LLMs and supervised fine-tuned models in predicting stock returns using financial news.
Leveraged a dataset of over 30,000 Dow Jones Newswire articles from 1989 to 2020
Evaluated OpenAI's GPT-4 against fine-tuned BERT models using a rolling-window cross-validation framework
Emphasized predictive accuracy, operational feasibility, cost efficiency, and scalability
GPT-4 achieved 53.38% directional accuracy and a Sharpe ratio of 1.076, providing consistent positive alphas
The fine-tuned BERT model had an annualized Sharpe ratio of 4.08 with mean daily returns of 12.81 basis points
Both models showed predictive robustness across different economic conditions, with varying performance metrics during market volatility

Abstract

This thesis investigates the comparative effectiveness of two large language model (LLM) deployment paradigms—zero-shot inference via commercial APIs and supervised fine-tuning of transformer-based models—for stock return prediction using financial news. Leveraging a novel dataset of over 30,000 Dow Jones Newswire (DJN) articles aligned with firm-level returns from 1989 to 2020, the study systematically evaluates OpenAI's GPT-4 in a zero-shot setting against fine-tuned BERT-based models trained on identical text–return pairs. The analysis employs a rolling-window cross-validation framework to ensure robustness across market regimes and emphasizes not only predictive accuracy but also operational feasibility, cost efficiency, and scalability. Results demonstrate that both modeling approaches produce statistically significant predictive power above random baselines, yet with distinct trade-offs. GPT-4 achieves 53.38% directional accuracy and a Sharpe ratio of 1.076, highlighting its ability to extract meaningful signals without domain-specific tuning. Its operational simplicity, low setup cost, and per-call flexibility make it particularly attractive for rapid deployment and lightweight financial applications. By contrast, the supervised BERT model delivers markedly stronger financial performance, attaining an annualized Sharpe ratio of 4.08 and mean daily returns of 12.81 basis points. These superior outcomes underscore the value of domain-specific adaptation, but they come at the cost of substantial GPU resources, specialized expertise, and ongoing maintenance requirements. Temporal analysis further reveals that both approaches maintain predictive robustness across diverse economic conditions—including the dot-com era, the 2008 financial crisis, and post-crisis market environments—though performance metrics vary with shifts in efficiency and volatility. Importantly, GPT-4 provides consistent positive alphas despite its modest accuracy, while fine-tuned BERT models extract higher-magnitude signals during volatile periods, reinforcing the economic relevance of tailored architectures. The findings contribute to the growing literature at the intersection of financial economics and natural language processing by offering the first unified empirical framework to compare API-driven and locally fine-tuned LLMs for financial prediction. They highlight a fundamental trade-off in institutional strategy: API-based models democratize access to advanced language capabilities with minimal infrastructure, while fine-tuned models yield superior risk-adjusted returns for institutions capable of sustaining the computational investment. This research advances both theory and practice by demonstrating that textual data continues to contain exploitable information for asset pricing and that the operational context—budget, infrastructure, and regulatory considerations—should guide the choice of modeling paradigm. Ultimately, the results suggest that financial institutions must navigate between accessibility and performance, as the integration of LLMs into quantitative trading evolves from experimental adoption toward strategic deployment.

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper