Abstract Manual penetration testing is increasingly inadequate against rapidly evolving cyber threats, as it is time-consuming, difficult to scale, and heavily reliant on human expertise. Whilst Artificial Intelligence (AI), particularly Large Language Models (LLMs), offers automation potential, current tools suffer from poor reasoning, limited context retention, low scalability, and safety and misuse concerns. This paper introduces AutoSecAgent, an AI-powered platform designed to address these limitations. Built on the adaptive Agent Zero orchestration framework and powered by the cybersecurity-focused DeepSeek LLM, AutoSecAgent supports semi-automated end-to-end penetration testing under human oversight, covering vulnerability discovery, exploitation, and reporting with remediation guidance for follow-up. It features recursive memory embedding to maintain context across complex multi-step attack chains, supporting strategic planning and situational awareness. Additionally, real-time Retrieval-Augmented Generation (RAG) allows AutoSecAgent to source up-to-date vulnerability data from repositories such as NVD and CVE, ensuring attack strategies incorporate current information. Its modular, agent-based architecture supports scalability across diverse network environments and integrates with tools such as Metasploit and Nmap. To avoid scattered claims across multiple design elements, this study focuses its core contributions on two tightly coupled points: (i) a memory-centric orchestration mechanism (Recursive Memory Embedding, RME) with an explicit operationalisation of context-loss errors for penetration testing trajectories and (ii) a reproducible, tool-grounded, semi-automated workflow that combines RME with real-time retrieval and a bounded online optimisation loop under human oversight. Multi-agent decomposition and DeepSeek domain tuning are treated as supporting design choices that enable these two contributions rather than separate innovations. Experimental results indicate that this design improves end-to-end performance over representative LLM-based baselines across vulnerability detection, exploitation, multi-step attack completion, and scalability in controlled evaluation settings. The empirical study is positioned as a proof-of-concept validation in controlled and semi-realistic settings. Broader operational generalisation and comparison against additional non-LLM baseline classes (e.g. scanner-only pipelines) are treated as future work.
Al-Sabbagh et al. (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: