What type of study is this?

This is a Quantitative Study study.

September 18, 2025

Evaluating LLMs for Multimodal GUI Test Generation in Android Applications

Key Points

LLMs can generate GUI tests for Android applications from various inputs, enhancing automation.
Claude 3 Sonnet produced the most detailed test sequences, outperforming other models significantly.
The study evaluates four LLMs using inputs from eight open-source Android apps, assessing their performance.
Variation in model effectiveness suggests further research is needed to optimize LLM applications in software testing.

Abstract

Graphical User Interface (GUI) testing is an important task in mobile application development but remains time-consuming when done manually. With the rise of Large Language Models (LLMs), there is growing interest in their potential to automate software development tasks, including GUI test generation. This study investigates the ability of LLMs to generate GUI test intentions and scripts for Android applications using multimodal inputs, such as screenshots and structured UI data. We present an approach that combines visual and textual input from eight open-source Android apps and evaluate the performance of four LLMs. The results show significant variation in the models’ ability to generate GUI tests: Claude 3 Sonnet produced the most detailed and complete test sequences, GPT-4o generated simpler test scripts with fewer test intentions and user interactions, focusing on more basic user flows, while Gemini 2.5 and Gemma 3 presented moderate and similar results. These findings indicate that while LLMs can aid GUI test automation, their effectiveness varies significantly across models.

Bookmark

Evaluating LLMs for Multimodal GUI Test Generation in Android Applications

Key Points

Abstract

Cite This Study