What question did this study set out to answer?

This review aims to examine how large language models enhance GUI testing by bridging the semantic gap between human and machine understanding.

May 8, 2026Open Access

A process-centric review of large language models in graphical user interface testing: architectures, lifecycle impact, and challenges

Key Points

This review aims to examine how large language models enhance GUI testing by bridging the semantic gap between human and machine understanding.
Reviewed 55 studies from January 2023 to July 2025 on LLM architectures and impacts on GUI testing.
Analyzed the integration of LLM agents across the testing lifecycle: design, scripting, execution, and maintenance.
Identified architectural trends and lifecycle influences in GUI testing processes.
Effective LLM agents use a spatial-semantic perception model, combining visual and DOM elements.
Shifting from specification-based scripting to autonomous, intent-driven testing and maintenance processes.
Current benchmarks highlight a gap between academic prototypes and industrial needs in reliability and efficiency.

Abstract

Graphical User Interface (GUI) testing has historically struggled with the “semantic gap” between human understanding and machine execution. Large Language Models (LLMs) are now bridging this gap by enabling a transition from automating repetitive actions to automating cognitive processes. This article presents a process-centric review of 55 seminal studies published between January 2023 and July 2025 to systematize this rapid evolution. Unlike existing surveys that focus on isolated architectural elements, we analyze the integration of LLM agents across the entire testing lifecycle, from test design and scripting to execution, oracle verification, and maintenance. Our analysis reveals three key findings: (1) Architecture: Effective agents have converged on a “spatial-semantic” perception model, combining visual screenshots with Document Object Model (DOM) structures to ground high-level intent into precise actions. (2) Lifecycle Impact: The paradigm is shifting from rigid, specification-based script generation to autonomous, intent-driven exploration and self-healing maintenance via abstraction-concretization mechanisms. (3) Evaluation: While current benchmarks effectively measure task completion, a disconnect remains between academic prototypes and industrial requirements regarding reliability, cost, and latency. The article concludes by identifying critical gaps in business process testing and outlining a research roadmap to advance LLM-based testing from experimental prototypes to robust, enterprise-grade quality assurance solutions.

Bookmark

View Full Paper

Cite This Study

Trong et al. (Wed,) studied this question.

synapsesocial.com/papers/69fd7e90bfa21ec5bbf06c90 https://doi.org/https://doi.org/10.7717/peerj-cs.3695

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Bookmark

View Full Paper