What question did this study set out to answer?

This research aims to evaluate the accessibility of web code generated by large language models (LLMs) compared to human-written code.

June 14, 2026

An Empirical Study on Evaluating Accessible Code Generation Capabilities of LLMs

Key Points

This research aims to evaluate the accessibility of web code generated by large language models (LLMs) compared to human-written code.
Compared accessibility of code generated by GPT-4o, Qwen2.5-Coder-32B-Instruct-AWQ, and Gemini-3-Flash with human-written code.
Assessed advanced prompting strategies: Zero-Shot, Few-Shot, Self-Criticism.
Introduced FeedA11y, a feedback-driven approach for enhancing accessibility in code generation.
LLMs generated more accessible code for basic features like color contrast and alternative text.
Complex issues like ARIA attributes presented challenges for LLM-generated code.
Prompting strategies offered limited gains in accessibility improvements.

Abstract

Web accessibility is essential for inclusive digital experiences, yet the accessibility of LLM-generated code remains underexplored. This paper presents an empirical study comparing the accessibility of web code generated by GPT-4o, Qwen2.5-Coder-32B-Instruct-AWQ, and Gemini-3-Flash against human-written code. Results show that LLMs often produce more accessible code, especially for basic features like color contrast and alternative text, but struggle with complex issues such as ARIA attributes. We also assess advanced prompting strategies (Zero-Shot, Few-Shot, Self-Criticism), finding they offer some gains but are limited. To address these gaps, we introduce FeedA11y , a feedback-driven ReAct-based approach that demonstrates the potential of incorporating accessibility evaluation results into the code generation process. Our work highlights the promise of LLMs for accessible code generation and emphasizes the need for feedback-based techniques to address persistent challenges. We provide the source code and datasets that were used in our experiments in the companion website 15.

Demander à l'IA

Bookmark

Cite This Study

Suh et al. (Fri,) studied this question.

synapsesocial.com/papers/6a2e4632b1cc60ccdea8afa2 https://doi.org/https://doi.org/10.1145/3820782

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Demander à l'IA

Bookmark