What does this research mean for the field?

Retrieval-augmented few-shot prompting with relevance-based exemplars significantly improves the accuracy and downstream trading profitability of frozen large language models for financial sentiment classification on social media by mitigating linguistic drift. Novelty: ClaimNovelty.METHODOLOGICAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

This work aims to improve financial sentiment classification on social media by adapting large language models to rapidly changing language and context.

June 6, 2026Open Access

Retrieval-augmented few-shot large language models (LLMs) for financial sentiment classification on social media

Key Points

This work aims to improve financial sentiment classification on social media by adapting large language models to rapidly changing language and context.
Used a frozen large language model with a lightweight external memory for inference-time calibration.
Constructed an exemplar memory from previously annotated financial tweets for relevance-based retrieval.
Employed cosine similarity to retrieve top-k exemplars and combined them into few-shot prompts for classification.
Relevance-based retrieval improved accuracy and Macro-F1 score across various datasets.
On GPT-5, retrieval achieved a 1 percentage point improvement; on LLaMA-3, gains ranged from 10–14 percentage points in Macro-F1.
In a trading backtest, the retrieval-augmented method yielded an annualized return of 19.24%, outperforming notable baselines.

Abstract

Financial sentiment classification on social media is challenging due to rapidly evolving language, including slang, sarcasm, and shifting market targets. Models that perform well on financial news often degrade on tweet-style text because of domain shift and temporal drift, resulting in unstable accuracy and frequent recalibration requirements. This work addresses sentiment analysis under such non-stationarity by calibrating a frozen large language model (LLM) at inference time using a lightweight external memory, enabling adaptation to contemporaneous market language without modifying model parameters. We construct an external exemplar memory composed of previously annotated financial tweets. For each incoming tweet, both the query and all memory items are encoded using a finance-specific sentence encoder based on FinBERT embeddings. Cosine similarity is then used to retrieve the top-k most relevant labeled exemplars (k = 5). These exemplars are combined with the query into a standardized few-shot prompt that asks the frozen LLM to classify sentiment as Positive, Neutral, or Negative. In this design, retrieval acts as a non-parametric inference-time prior, locally constraining the model’s decision space toward contemporaneous market semantics while avoiding permanent parameter updates. We evaluate zero-shot prompting, few-shot prompting with random exemplars, and relevance-based retrieval-augmented prompting across multiple tweet datasets, including standard benchmarks and a more recent 2024 out-of-distribution tweet set collected after the reported training cutoff of Large Language Model Meta AI (LLaMA)-3. Downstream utility is evaluated by converting daily sentiment predictions into next-day long-short trading signals. Across all models and datasets, relevance-based retrieval consistently improves accuracy and macro-averaged F1 score (Macro-F1) relative to non-retrieval baselines. On a strong GPT-5 configuration, retrieval yields a modest but statistically significant improvement of approximately 1 percentage point, while on LLaMA-3 the gains are substantially larger, ranging from 10–14 percentage points with comparable Macro-F1 increases. In a 2015 trading backtest, retrieval-augmented prompting achieves an annualized return of 19.24%, outperforming non-retrieval Generative Pre-trained Transformer 5 (GPT-5) (14.33%) as well as supervised domain baselines such as Financial Bidirectional Encoder Representations from Transformers (FinBERT) (13.98%) and FinLLaMA (11.59%), while the Standard & Poor’s 500 (S&P 500) index records −0.73% over the same period. These results demonstrate that inference-time exemplar retrieval can provide effective calibration under linguistic drift, matching or approaching fine-tuned systems while reducing the need for continual retraining and model upkeep.

Read Full Paperexternally

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper