What type of study is this?

This is a Validation Study study.

What question did this study set out to answer?

The aim is to develop a framework that merges language models while maintaining alignment quality.

December 22, 2025Open Access

AlignMerge - Alignment-Preserving Large Language Model Merging via Fisher-Guided Geometric Constraints

Key Points

The aim is to develop a framework that merges language models while maintaining alignment quality.
Introduced AlignMerge as a geometry-aware merging framework.
Estimated alignment subspace in a local Fisher chart around an instruction-tuned base.
Optimized the L_AlignMerge function incorporating alignment, geometry, and budget constraints.
AlignMerge improves alignment metrics including AQI and toxicity.
It matches or exceeds expert models in instruction-following and reasoning.
Exhibited smaller alignment drift and fewer budget violations compared to existing methods.

Abstract

Merging large language models (LLMs) is a practical way to compose capabilities from multiple fine-tuned checkpoints without retraining. Yet standard schemes (linear weight soups, task vectors, and Fisher-weighted averaging) can preserve loss while quietly destroying alignment. We argue that merging is not a numerical trick but a geometry-constrained operation around an already-aligned anchor: fusion must be steered to respect safety geometry, not validated post hoc. We introduce AlignMerge, a geometry-aware merging framework that makes alignment an explicit invariant. In a local Fisher chart around an instruction-tuned base, we estimate an alignment subspace with projector PA and optimize: LAlignMerge = Lgeo + lambdaₐlign * Lₐlign + lambdabud * Lbud, where Lgeo keeps the merge close to its experts in Fisher-Rao geometry, Lₐlign penalizes motion along alignment-sensitive directions, and Lbud enforces a soft alignment budget. As the alignment functional we use the decoding-invariant Alignment Quality Index (AQI), a latent-space criterion that captures how cleanly aligned and misaligned behaviors separate in representation space. Across five model families (LLaMA-3 8B, Mistral 7B, Qwen 2, Phi-3. 5, Gemma 2), merging safety anchors with task experts, AlignMerge improves alignment metrics (AQI, toxicity, LLM-judge alignment) while matching or exceeding the best expert on instruction-following, reasoning, and helpfulness. It also exhibits smaller alignment-subspace drift and fewer budget violations than Fisher soups, TIES, SafeMerge, and MergeAlign. These results make alignment-preserving merging a first-class design goal and suggest a path to geometry-aware composition of future foundation models.

Read Full Paperexternally

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper