What question did this study set out to answer?

This research examines how deep learning models encode and define accessibility concepts, testing their structural conditions for behavioral capability.

May 26, 2026Open Access

Accessibility Concept Emergence in the Pythia Suite: Thresholds, Binding, and the Declarative-Evaluative Gap

Key Points

This research examines how deep learning models encode and define accessibility concepts, testing their structural conditions for behavioral capability.
Utilized the Pythia suite (160M–12B) and GPT-2 (small–XL) with TransformerLens to analyze model knowledge encoding.
Employed Web Content Accessibility Guidelines (WCAG) as a test domain for clarity and relevance.
Conducted entropy analysis to uncover the internal structure of model knowledge gaps across 15 prompts.
Models exhibit compound binding in early layers, but persistent binding to late layers differentiates performance.
Screen reader and skip link functionalities emerge at ~2.8B parameters; WCAG appears at 6.9B parameters.
A consistent declarative-evaluative gap exists where models that define accessibility concepts struggle to identify code violations.

Abstract

Sustained deep-network binding of accessibility compounds appears to be a necessary structural condition for behavioral capability — present in every model that correctly defines core concepts, absent in every model that fails. We use Web Content Accessibility Guidelines (WCAG) as our test domain because accessibility represents a specialized, low-frequency domain in web-scale training data — a small, well-defined vocabulary with unambiguous answers and direct relevance to real-world tooling decisions, making it an unusually clean lens for studying emergence: concepts are concrete enough to evaluate and rare enough to show scale sensitivity. Using the Pythia suite (160M–12B) and GPT-2 (small–XL) with TransformerLens, we investigate not just what models know but how that knowledge is encoded internally. All models show compound binding in early layers; the differentiating factor is whether that binding persists to late network layers. Screen reader, skip link, and alt text emerge behaviorally at ~2.8B; WCAG first appears at 6.9B; Accessible Rich Internet Applications (ARIA) exhibits fluent wrongness at every scale tested — producing confident, plausible expansions that are consistently incorrect. Models prefer correct definitions before they can produce them, and a declarative-evaluative gap persists even at maximum scale: models that correctly define accessibility concepts cannot reliably identify violations in code. The gap is robust across 15 prompts spanning three elicitation strategies and is not an artifact of prompt design. Entropy analysis reveals that the gap has internal structure — the model enters distinct failure states depending on how it is asked, from high-entropy stalling to low-entropy confident parroting. Extending the binding analysis to Pythia 12B introduces a late-layer resurgence pattern: a cluster of binding heads re-engaging near the output layers that scales monotonically with model size.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Trisha Salas (Thu,) studied this question.

synapsesocial.com/papers/6a153bdfb5d9c58d83e8d528 https://doi.org/https://doi.org/10.5281/zenodo.20360788

Bookmark

View Full Paper