What question did this study set out to answer?

The aim is to assess how large language models perform in generating stock market decisions based solely on historical price data.

April 19, 2026Open Access

Evaluating the Efficacy of Large Language Models in Stock Market Decision-Making: A Decision-Focused, Price-Only, Multi-Country Analysis Using Historical Price Data

Key Points

The aim is to assess how large language models perform in generating stock market decisions based solely on historical price data.
Evaluated three LLMs: GPT-4.0, Gemini 2.0 Flash, and LLaMA-4-Scout-17B-16E.
Used historical closing-price data of 150 stocks from three countries across ten sectors.
Limited to numerical inputs over three years of time series data without supplementary information.
Asked LLMs three types of finance-related questions regarding stock actions.
Performance analyzed over two investment horizons: 1 month and 3 months.
GPT-4.0 achieved the highest average accuracy at 56%.
LLaMA-4-Scout-17B-16E followed with 48% accuracy.
Gemini 2.0 Flash had the lowest accuracy at 39%.
Overall performance was moderate, varying with market conditions, particularly higher in high-volatility scenarios.

Abstract

This study provides a comparative evaluation of three state-of-the-art large language models (LLMs), namely OpenAI’s (San Francisco, CA, USA) GPT-4.0, Google’s Google LLC, Mountain View, CA, USA) Gemini 2.0 Flash, and Meta’s (Meta Platforms, Menlo Park, CA, USA) LLaMA-4-Scout-17B-16E, in a decision-oriented framework in which the models generate structured outputs based only on historical closing-price data. The evaluation covers 150 stocks sampled from three countries (India, the United States, and South Africa) across ten economic sectors, including Information Technology, Banking, and Pharmaceuticals. Unlike many prior studies that combine numerical and textual inputs, this study relies solely on three years of numerical time series data and examines model responses in terms of decision labels such as buy, sell, or hold. The LLMs were provided with historical closing-price sequences and prompted with three types of finance-related questions: (a) whether to buy a stock, (b) whether to sell or hold a stock, and (c) in a pairwise comparison, which stock to buy or hold. These prompts were evaluated across two investment horizons: 1 month and 3 months. Model outputs were compared against realized market outcomes during the corresponding test periods. Performance was assessed across four key dimensions: country, sector, annualized volatility, and question type. The models were not given any supplementary financial information or instructions on specific analytical methods. The results indicate that GPT-4.0 achieves the highest average accuracy (56%), followed by LLaMA-4-Scout-17B-16E (48%) and Gemini 2.0 Flash (39%). Overall performance remains moderate and varies across market conditions, with relatively higher accuracy observed in high-volatility regimes (51%). This work evaluates how LLMs behave when presented with structured numerical price sequences in a controlled decision-labeling setting and contributes to the broader discussion on the potential and limitations of LLMs for numerical decision tasks in finance.

Read Full Paperexternally

AI에게 질문

Bookmark

View Full Paper