User interactions with mobile applications (apps) are accompanied by continuous visual changes in the Graphical User Interface (GUI), guiding task completion and feedback. These changes help users complete intended tasks or assess the appropriateness of their actions, typically conveyed through visual cues such as appearance and color. While such visual changes are effective for sighted users, they are inaccessible to blind users, creating substantial barriers to GUI interaction. To address these challenges, we propose VisualDroid , a method based on a multi-modal large language model (LLM) for testing and classifying GUI visual changes using a tailored three-hop reasoning prompting framework. VisualDroid achieved an F1-score of 94.7% in 34 apps from 17 domains, surpassing all baseline methods. When evaluated on five open-source apps from F-Droid, our method enabled developers to resolve three identified issues, with two still under review. In terms of efficiency and cost, our method indicates minimal resource consumption.
Zhang et al. (Tue,) studied this question.