Financial sentiment classification on social media is challenging due to rapidly evolving language, including slang, sarcasm, and shifting market targets. Models that perform well on financial news often degrade on tweet-style text because of domain shift and temporal drift, resulting in unstable accuracy and frequent recalibration requirements. This work addresses sentiment analysis under such non-stationarity by calibrating a frozen large language model (LLM) at inference time using a lightweight external memory, enabling adaptation to contemporaneous market language without modifying model parameters. We construct an external exemplar memory composed of previously annotated financial tweets. For each incoming tweet, both the query and all memory items are encoded using a finance-specific sentence encoder based on FinBERT embeddings. Cosine similarity is then used to retrieve the top-k most relevant labeled exemplars (k = 5). These exemplars are combined with the query into a standardized few-shot prompt that asks the frozen LLM to classify sentiment as Positive, Neutral, or Negative. In this design, retrieval acts as a non-parametric inference-time prior, locally constraining the model’s decision space toward contemporaneous market semantics while avoiding permanent parameter updates. We evaluate zero-shot prompting, few-shot prompting with random exemplars, and relevance-based retrieval-augmented prompting across multiple tweet datasets, including standard benchmarks and a more recent 2024 out-of-distribution tweet set collected after the reported training cutoff of Large Language Model Meta AI (LLaMA)-3. Downstream utility is evaluated by converting daily sentiment predictions into next-day long-short trading signals. Across all models and datasets, relevance-based retrieval consistently improves accuracy and macro-averaged F1 score (Macro-F1) relative to non-retrieval baselines. On a strong GPT-5 configuration, retrieval yields a modest but statistically significant improvement of approximately 1 percentage point, while on LLaMA-3 the gains are substantially larger, ranging from 10–14 percentage points with comparable Macro-F1 increases. In a 2015 trading backtest, retrieval-augmented prompting achieves an annualized return of 19.24%, outperforming non-retrieval Generative Pre-trained Transformer 5 (GPT-5) (14.33%) as well as supervised domain baselines such as Financial Bidirectional Encoder Representations from Transformers (FinBERT) (13.98%) and FinLLaMA (11.59%), while the Standard & Poor’s 500 (S&P 500) index records −0.73% over the same period. These results demonstrate that inference-time exemplar retrieval can provide effective calibration under linguistic drift, matching or approaching fine-tuned systems while reducing the need for continual retraining and model upkeep.
凌继贤 et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: