What question did this study set out to answer?

The aim is to enhance understanding of user actions and activities in UI logs without structured information.

February 14, 2026Open Access

Enriching Process-Related UI Logs Via Screenshot-Based Activity Labeling by Using Vision-Language Models

Key Points

The aim is to enhance understanding of user actions and activities in UI logs without structured information.
Proposed a framework based on screenshot-based techniques
Generated semantic descriptions of user actions
Evaluated with a manually-labeled dataset from desktop applications
Effectively generates semantic descriptions of user interactions
Enables precise descriptions of high-level user activities
Improves understanding of candidate business processes for automation

Abstract

Abstract Robotic Process Mining (RPM) leverages User Interface (UI) logs as a source of information to analyze the processes which are to be automated. The UI logs keep a record of user interactions with the graphical UI of an information system during the execution of a process, encapsulating a large amount of data. Prior research has proposed methods to interpret the UI logs by exploiting the structured information available on-screen (e.g., the DOM tree of a Web page) which makes the analysts’ interpretation of the processes behind the logs easier. However, in environments where such structured information is not available (e.g., in virtualized environments), understanding user actions and high-level activities via the elements that the users interact with poses a challenge that remains unsolved. This limitation hinders the application of RPM techniques in these environments, thereby requiring human intervention to analyze and understand the actions carried out within these UI logs. To address this challenge, the authors propose a framework that leverages screenshot-based techniques to generate semantic descriptions of user actions and enable us to generate accurate descriptions of high-level activities by solely relying on the information available in the UI logs. In an organizational context, this approach enables RPA analysts and process managers to analyze user interaction logs and improve the understanding of the candidate business processes for automation. We evaluate our approach using a manually-labeled dataset of screenshots from realistic desktop applications. Our results demonstrate that the method can effectively generate semantic descriptions of user actions which, in turn, enable more precise descriptions of the high-level activities carried out by the user.

Bookmark

View Full Paper

Bookmark

View Full Paper

Enriching Process-Related UI Logs Via Screenshot-Based Activity Labeling by Using Vision-Language Models

Key Points

Abstract

Cite This Study