What type of study is this?

This is a Quantitative Study study.

October 20, 2025Open Access

Diffusion vs. Autoregressive Language Models: A Text Embedding Perspective

Key Points

Diffusion language models outperform LLM-based models by achieving 20% better performance in long-document retrieval.
The study highlights that bidirectional attention is critical for effectively encoding global context in text.
Autoregressive models show limitations due to unidirectional attention, which misaligns with text embedding tasks.
Diffusion models demonstrate competitive performance on various traditional text embedding benchmarks.

Abstract

Large language model (LLM)-based embedding models, benefiting from large scale pre-training and post-training, have begun to surpass BERT and T5-based models on general-purpose text embedding tasks such as document retrieval. However, a fundamental limitation of LLM embeddings lies in the unidirectional attention used during autoregressive pre-training, which misaligns with the bidirectional nature of text embedding tasks. To this end, We propose adopting diffusion language models for text embeddings, motivated by their inherent bidirectional architecture and recent success in matching or surpassing LLMs especially on reasoning tasks. We present the first systematic study of the diffusion language embedding model, which outperforms the LLM-based embedding model by 20% on long-document retrieval, 8% on reasoning-intensive retrieval, 2% on instruction-following retrieval, and achieve competitive performance on traditional text embedding benchmarks. Our analysis verifies that bidirectional attention is crucial for encoding global context in long and complex text.

Read Full Paperexternally

Demander à l'IA

Bookmark

View Full Paper