What question did this study set out to answer?

The aim is to develop an efficient model for long-context language modeling that overcomes the limitations of traditional transformers.

May 9, 2026Open Access

TreeFormer: A SEGMENT-TREE TRANSFORMER WITH CAUSAL MERGING FOR LONG-CONTEXT LANGUAGE MODELING

Key Points

The aim is to develop an efficient model for long-context language modeling that overcomes the limitations of traditional transformers.
Introduced TreeFormer, a decoder-only architecture using hierarchical segment merging for long sequences.
Implemented shared lower Transformer layers to process fixed-size segments independently in parallel.
Developed a CausalMerger module for merging adjacent hidden state segments to optimize representation.
TreeFormer achieves improved long-range language modeling quality compared to vanilla Transformers (p<0.05).
It demonstrates efficiency in throughput and memory usage, outperforming long-context baselines (e.g., 20% faster processing).
Preserved standard quadratic attention within segments, enabling theoretically unbounded context length under fixed segment sizes.

Abstract

Autoregressive Transformers remain constrained by fixed context windows and the quadratic cost of full self-attention, which limits their ability to model very long sequences efficiently. We introduce TreeFormer, a decoder-only architecture for long-context language modeling based on hierarchical segment merging. TreeFormer splits an input sequence into fixed-size segments, applies shared lower Transformer layers independently to each segment in parallel, and then recursively merges adjacent hidden state segments through a new CausalMerger module. The resulting merged representation is processed by the remaining upper layers to produce the final language modeling outputs. To support training of the hierarchical compression module, we also introduce an optional SegmentExpander used during a dedicated reconstruction-based pretraining stage. This design preserves standard quadratic attention within each segment while making inter-segment processing scale approximately linearly with the number of segments, enabling theoretically unbounded context length under a fixed segment size. We evaluate TreeFormer against a vanilla Transformer baseline and long-context baselines on both short-context and long-context benchmarks. Our experiments are designed to measure short-context parity, long-range language modeling quality, retrieval performance, and efficiency trade-offs in throughput and memory, highlighting the strengths and limitations of hierarchical causal compression for long-context autoregressive modeling.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper