A large corpus is first partitioned into computationally manageable chunks, then Noken implicit models are used to jointly learn query-key-value embeddings on each chunk.To compare a pair of embeddings, we use their ability to capture semantics on each other’s training chunk, as measured by average Renyi α-entropy. After a bubble sort, the resulting chunk Q-K-V token embedding is used across the entire corpus for the purposes of transformer attention.
Gary Nan Tie (Thu,) studied this question.