Reusing separate, pre-filled Key-Value (KV) Caches for multiple contexts has become a common practice in handling multi-context scenarios with Large Language Models. However, this leads to a lack of cross-attention mechanisms between contexts. To address this, we propose CatLLM, the first method that concatenates multiple contexts across requests offline to compensate for this deficiency. Specifically, during offline processing, CatLLM identifies contexts that severely lack cross-attention by incorporating the weighted inner products of Q and K vectors from tokens in an un-concatenated context into an equivalently transformed weighted formulation for concatenated Q and K inner products. This yields a weighting wiA+B corresponding to the output vector difference, which can then be used to identify contexts with severe cross-attention deficiencies and concatenate them into a single context for KV Cache computation. Experimental results show that, compared to the baseline of separate caching (i.e., no concatenation), fully concatenating all contexts improves the F1 score by 6%. Meanwhile, the proposed method reduces the number of contexts requiring caching from 10 to 7 while achieving a 3% F1 score, thereby maximizing performance improvement while minimizing the degree of context compression.
Cao et al. (Wed,) studied this question.