We present a systematic empirical study examining how preference optimization methods (RLHF, DPO) affect attention head specialization across eight vendor families and more than 25 large language model variants. Using a standardized evaluation protocol (bfloat16 precision, three-seed cross-validation, and SHA-256–verified prompts), we quantify attention head diversity via the Specialization Index (SI) and compare base and instruction-tuned model pairs. Main finding: Robustness to alignment-induced specialization loss is strongly associated with training methodology, following a consistent hierarchy: Training Methodology > Sliding Window Attention > Architecture > Scale. Key results: SI reduction pattern: RLHF and DPO reduce SI in most model families lacking architectural protection (LLaMA-3.1: −56.3%; LLaMA-2: −7.95%), whereas models equipped with Sliding Window Attention maintain or increase specialization (Mistral: +4.2%). Architecture-dependent sensitivity: At matched scale, Grouped Query Attention exhibits approximately 5,800× higher sensitivity to random attention noise than Multi-Head Attention (ratio-of-means across three seeds; permutation test, p < 0.05). Training-based robustness: Synthetic training (Phi family) yields scale-invariant specialization (SI ≈ 0.33 across a 10.8× parameter range), and Qwen2 shows no observed recursive degradation within the tested 50-generation window. This release includes 19 documented Jupyter notebooks that support the full experimental pipeline, 27 result JSON files, and command-line tools that enable end-to-end reproducibility.The paper text is released under CC-BY-4.0; accompanying code and tooling are released under the MIT License.
Davide D'Elia (Tue,) studied this question.