Autoregressive Transformers remain constrained by fixed context windows and the quadratic cost of full self-attention, which limits their ability to model very long sequences efficiently. We introduce TreeFormer, a decoder-only architecture for long-context language modeling based on hierarchical segment merging. TreeFormer splits an input sequence into fixed-size segments, applies shared lower Transformer layers independently to each segment in parallel, and then recursively merges adjacent hidden state segments through a new CausalMerger module. The resulting merged representation is processed by the remaining upper layers to produce the final language modeling outputs. To support training of the hierarchical compression module, we also introduce an optional SegmentExpander used during a dedicated reconstruction-based pretraining stage. This design preserves standard quadratic attention within each segment while making inter-segment processing scale approximately linearly with the number of segments, enabling theoretically unbounded context length under a fixed segment size. We evaluate TreeFormer against a vanilla Transformer baseline and long-context baselines on both short-context and long-context benchmarks. Our experiments are designed to measure short-context parity, long-range language modeling quality, retrieval performance, and efficiency trade-offs in throughput and memory, highlighting the strengths and limitations of hierarchical causal compression for long-context autoregressive modeling.
François MONDÉ KOSSI (Thu,) studied this question.