Large language models (LLMs) have demonstrated strong performance across a wide range of natural language processing tasks, but their effectiveness in specialized domains such as finance remains insufficiently understood. Financial language is characterized by domainspecific terminology, numerically grounded reasoning, context-sensitive interpretation, and high-stakes decision environments, all of which create additional challenges for general-purpose models. This study evaluates FinGPT, a domain-adapted financial LLM, across six core financial NLP tasks: sentiment analysis, text classification, named entity recognition, financial question answering, text summarization, and stock movement prediction. A comparative benchmark framework is employed to assess FinGPT against GPT-4, FinMA 7B, human performance where available, and selected task-specific baselines. The evaluation is conducted using established financial datasets and task-appropriate metrics, including accuracy, F1-score, exact match, and ROUGE. The results show that FinGPT performs strongly in structured classification tasks, particularly sentiment analysis and headline classification, where it achieves competitive and in some cases superior results relative to benchmark models. However, its performance declines substantially in tasks requiring deeper reasoning, numerical precision, long-context understanding, and coherent generation, especially in financial question answering and summarization. In stock movement prediction, FinGPT demonstrates moderate performance but shows directional sensitivity and stronger alignment with bullish than bearish market conditions. These findings indicate that domain adaptation improves performance in well-defined financial NLP tasks, but does not fully overcome limitations in reasoning-intensive and generation-heavy applications. This study contributes a task-level benchmark and comparative analysis of FinGPT's capabilities and weaknesses, providing practical guidance for the development, evaluation, and deployment of domain-specific financial language models. A key limitation of this work is that the evaluation relies primarily on benchmark datasets and automatic metrics, with limited human-centered assessment for generative tasks. Practically, the findings suggest that FinGPT is promising for specialized, auditable financial NLP workflows, but remains unsuitable as a full replacement for more advanced general-purpose models in complex financial intelligence settings.
Djagba et al. (Mon,) studied this question.