What question did this study set out to answer?

This study examines the effectiveness of large language models (LLMs) in dynamic application security testing for Android apps.

May 16, 2026Open Access

LLM-Guided Dynamic Security Testing of Android Applications: A Comparative Study Across Selected Models

Key Points

This study examines the effectiveness of large language models (LLMs) in dynamic application security testing for Android apps.
Compared five LLM-based systems: Gemini 2.5 Flash, GPT-oss 120B, Llama 3.3 70B, Qwen 3 32B, and Trinity Large Preview.
Evaluated models on purposely vulnerable Android applications using weighted step-level scoring.
Conducted main guidance experiments with aggregated data from multiple runs.
Gemini achieved the highest FD + PD detection rate of 79.1% in the representative run.
Repeated runs for selected models showed limited variability in outcomes.
More detailed prompts improved the guidance quality provided by LLMs.

Abstract

The rapid growth of publicly available digital services increases the need for scalable security assessment. This is particularly important for software directly used by end users, such as Android applications. Due to staff shortages and financial constraints, small and medium-sized enterprises are often unable to test their applications for vulnerabilities. One possible support mechanism is the use of large language models (LLMs) to assist testers during such assessments. The aim of this study was to investigate the possibility of using an LLM as an interactive guide for dynamic application security testing (DAST) of Android applications. For this purpose, five LLM-based systems were compared: Gemini 2.5 Flash, GPT-oss 120B, Llama 3.3 70B, Qwen 3 32B, and Trinity Large Preview accessed via OpenRouter. The models were evaluated on intentionally vulnerable Android applications using weighted step-level scoring and three selected exploit guidance scenarios. In the main guidance experiment, Gemini achieved the highest combined Fully Discovered and Partially Discovered (FD + PD) detection rate of 79.1% in the representative run, while repeated runs for selected models showed limited aggregate variability. The results also indicate that more detailed prompts improve the quality of generated guidance. The findings suggest that LLMs can serve as interactive guides for DAST testing of Android applications, although they should be treated as supporting tools rather than standalone security-testing systems.

LLM-Guided Dynamic Security Testing of Android Applications: A Comparative Study Across Selected Models

Key Points

Abstract

Cite This Study