The rapid growth of publicly available digital services increases the need for scalable security assessment. This is particularly important for software directly used by end users, such as Android applications. Due to staff shortages and financial constraints, small and medium-sized enterprises are often unable to test their applications for vulnerabilities. One possible support mechanism is the use of large language models (LLMs) to assist testers during such assessments. The aim of this study was to investigate the possibility of using an LLM as an interactive guide for dynamic application security testing (DAST) of Android applications. For this purpose, five LLM-based systems were compared: Gemini 2.5 Flash, GPT-oss 120B, Llama 3.3 70B, Qwen 3 32B, and Trinity Large Preview accessed via OpenRouter. The models were evaluated on intentionally vulnerable Android applications using weighted step-level scoring and three selected exploit guidance scenarios. In the main guidance experiment, Gemini achieved the highest combined Fully Discovered and Partially Discovered (FD + PD) detection rate of 79.1% in the representative run, while repeated runs for selected models showed limited aggregate variability. The results also indicate that more detailed prompts improve the quality of generated guidance. The findings suggest that LLMs can serve as interactive guides for DAST testing of Android applications, although they should be treated as supporting tools rather than standalone security-testing systems.
Łabęda et al. (Thu,) studied this question.