This thesis investigates the comparative effectiveness of two large language model (LLM) deployment paradigms—zero-shot inference via commercial APIs and supervised fine-tuning of transformer-based models—for stock return prediction using financial news. Leveraging a novel dataset of over 30,000 Dow Jones Newswire (DJN) articles aligned with firm-level returns from 1989 to 2020, the study systematically evaluates OpenAI's GPT-4 in a zero-shot setting against fine-tuned BERT-based models trained on identical text–return pairs. The analysis employs a rolling-window cross-validation framework to ensure robustness across market regimes and emphasizes not only predictive accuracy but also operational feasibility, cost efficiency, and scalability. Results demonstrate that both modeling approaches produce statistically significant predictive power above random baselines, yet with distinct trade-offs. GPT-4 achieves 53.38% directional accuracy and a Sharpe ratio of 1.076, highlighting its ability to extract meaningful signals without domain-specific tuning. Its operational simplicity, low setup cost, and per-call flexibility make it particularly attractive for rapid deployment and lightweight financial applications. By contrast, the supervised BERT model delivers markedly stronger financial performance, attaining an annualized Sharpe ratio of 4.08 and mean daily returns of 12.81 basis points. These superior outcomes underscore the value of domain-specific adaptation, but they come at the cost of substantial GPU resources, specialized expertise, and ongoing maintenance requirements. Temporal analysis further reveals that both approaches maintain predictive robustness across diverse economic conditions—including the dot-com era, the 2008 financial crisis, and post-crisis market environments—though performance metrics vary with shifts in efficiency and volatility. Importantly, GPT-4 provides consistent positive alphas despite its modest accuracy, while fine-tuned BERT models extract higher-magnitude signals during volatile periods, reinforcing the economic relevance of tailored architectures. The findings contribute to the growing literature at the intersection of financial economics and natural language processing by offering the first unified empirical framework to compare API-driven and locally fine-tuned LLMs for financial prediction. They highlight a fundamental trade-off in institutional strategy: API-based models democratize access to advanced language capabilities with minimal infrastructure, while fine-tuned models yield superior risk-adjusted returns for institutions capable of sustaining the computational investment. This research advances both theory and practice by demonstrating that textual data continues to contain exploitable information for asset pricing and that the operational context—budget, infrastructure, and regulatory considerations—should guide the choice of modeling paradigm. Ultimately, the results suggest that financial institutions must navigate between accessibility and performance, as the integration of LLMs into quantitative trading evolves from experimental adoption toward strategic deployment.
Rong Bai (Fri,) studied this question.