Graphical User Interface (GUI) testing is an important task in mobile application development but remains time-consuming when done manually. With the rise of Large Language Models (LLMs), there is growing interest in their potential to automate software development tasks, including GUI test generation. This study investigates the ability of LLMs to generate GUI test intentions and scripts for Android applications using multimodal inputs, such as screenshots and structured UI data. We present an approach that combines visual and textual input from eight open-source Android apps and evaluate the performance of four LLMs. The results show significant variation in the models’ ability to generate GUI tests: Claude 3 Sonnet produced the most detailed and complete test sequences, GPT-4o generated simpler test scripts with fewer test intentions and user interactions, focusing on more basic user flows, while Gemini 2.5 and Gemma 3 presented moderate and similar results. These findings indicate that while LLMs can aid GUI test automation, their effectiveness varies significantly across models.
Fagundes et al. (Wed,) studied this question.