What question did this study set out to answer?

To develop a new framework for detecting code smells using agentic retrieval-augmented generation techniques that enhance adaptability and explainability.

March 6, 2026Open Access

Pioneering agentic retrieval-augmented generation in software quality: a novel framework for code smell detection via dynamic retrieval

Key Points

To develop a new framework for detecting code smells using agentic retrieval-augmented generation techniques that enhance adaptability and explainability.
Introduced Agentic RAG framework integrating autonomous agents for retrieval and reasoning.
Utilized hybrid retrieval strategies (sparse and dense) for improved context awareness.
Employed large language models for generating explainable outputs and contextual feedback.
Conducted experiments on multiple programming languages for generalization and scalability.
Achieved 89.5% accuracy and a macro F1-score of 78.3% in initial evaluations.
Expanded to detect 21 code smell types with 94.85% accuracy in broader trials.
Demonstrated robust performance and scalability through stratified five-fold cross-validation.

Abstract

Code smells—subtle indicators of poor design choices—pose significant challenges to software maintainability and readability, particularly in dynamic languages such as Python. Traditional detection methods, including rule-based heuristics and static machine learning classifiers, often suffer from limited adaptability, poor contextual awareness, and lack of explainability. These limitations hinder their effectiveness in evolving codebases and real-world development environments. This study introduces a novel Agentic retrieval-augmented generation (Agentic RAG) framework for code smell detection, marking the first application of agentic reasoning in this domain. By embedding autonomous agents into the retrieval and reasoning pipeline, the proposed system dynamically routes queries, selects optimal retrieval strategies, and synthesizes context-aware explanations using large language models (LLMs). Unlike static classifiers, the proposed framework leverages hybrid retrieval (sparse + dense) and structured prompting to detect and explain Long Method and Large Class smells with high interpretability. Experimental results demonstrate that Agentic RAG—particularly when paired with DeepSeek and chain-of-thought prompting—achieves superior performance, with 89.5% accuracy, a macro F1-score of 78.3%, and a weighted F1 of 88.7%. To assess generalization, Experiment 2 extended the framework to 21 distinct code smell types across multiple programming languages, achieving 94.85% accuracy, a macro F1-score of 90.24%, and a weighted F1-score of 94.93% through stratified five-fold cross-validation, thereby confirming the model’s robustness and scalability. Beyond academic benchmarks, this work lays the foundation for real-world integration into developer platforms, enabling real-time code review, contextual feedback, and actionable refactoring suggestions. By bridging LLMs with dynamic retrieval and agentic reasoning, this framework advances the frontier of intelligent software quality assurance.

Bookmark

View Full Paper