What question did this study set out to answer?

This research aims to evaluate the efficacy of Large Language Models (LLMs) in performing automated Linux privilege-escalation attacks.

February 12, 2026Open Access

LLMs as Hackers: Autonomous Linux Privilege Escalation Attacks

Key Points

This research aims to evaluate the efficacy of Large Language Models (LLMs) in performing automated Linux privilege-escalation attacks.
Introduced a novel, automated LLM-driven tool named hackingBuddyGPT for testing privilege escalation.
Curated a benchmark with single-vulnerability virtual machines for controlled evaluation.
Conducted empirical analysis comparing LLMs (GPT-3.5-Turbo, GPT-4-Turbo, Llama3) against human testers and traditional tools.
Investigated the effects of context management and guidance strategies on LLM performance.
GPT-4-Turbo achieved a success rate of 33–83%, comparable to human testers at 75%.
Llama3 showed limited success (0–33%), while GPT-3.5-Turbo had moderate rates (16–50%).
High-level guidance substantially improved GPT-4-Turbo's success rates.
Cost analysis indicated that GPT-4-Turbo performs competitively compared to human testers in exploiting vulnerabilities.

Abstract

Abstract Penetration-testing is crucial for identifying and mitigating system vulnerabilities, with privilege-escalation being a critical subtask involving gaining elevated access to protected resources. The emergence of Large Language Models (LLMs) presents new avenues for automating these security practices by emulating human behavior. However, a comprehensive understanding of LLMs’ efficacy and limitations in performing autonomous Linux privilege-escalation attacks remains underexplored. To address this gap, we introduce hackingBuddyGPT , a fully automated LLM-driven prototype designed for evaluating autonomous Linux privilege-escalation. We curated a novel, publicly available Linux privilege-escalation benchmark comprising distinct, single-vulnerability virtual machines, enabling controlled and reproducible evaluation. Our empirical analysis assesses the quantitative success rates and qualitative operational behaviors of various LLMs— GPT-3.5-Turbo , GPT-4-Turbo , and Llama3 —against baselines of human professional penetration-testers and traditional automated tools. We investigate the impact of context management strategies, different context sizes, and various high-level guidance mechanisms on LLM performance. Results show that GPT-4-Turbo demonstrates high efficacy, successfully exploiting 33–83% of vulnerabilities, a performance comparable to human penetration testers (75%). In contrast, local models like Llama3 exhibited limited success (0–33%), and GPT-3.5-Turbo achieved moderate rates (16–50%). High-level guidance significantly boosts LLM success rates , for instance when using GPT-4-Turbo from 33% to 66% (without guidance) or from 66% to 83%, while state management through LLM-driven reflection doubled unaided GPT-4-Turbo success rates (from 33% to 66%). Qualitative analysis reveals both LLMs’ strengths and weaknesses in generating valid commands and highlights challenges in common-sense reasoning, error handling, and multi-step exploitation, particularly with temporal dependencies. Cost analysis indicates that GPT-4-Turbo can achieve human-comparable performance at competitive costs per exploited vulnerability, especially with optimized context management. Our work provides a baseline for evaluating LLM capabilities in autonomous privilege escalation, guiding future research toward more effective and reliable LLM-guided penetration-testing.

Bookmark

View Full Paper