This paper proposes a bandwidth–compute coupled memory allocation and management strategy for high-dimensional vector retrieval on NUMA–CXL hybrid memory platforms. Based on empirical characterization of local DRAM, remote DRAM, and CXL nodes, a three-tier memory access cost model is established to guide a two-level optimization framework. The proposed approach integrates (i) bandwidth-aware data partitioning that assigns contiguous vector segments to each memory node in proportion to its measured effective bandwidth, and (ii) NUMA-aware, compute-coupled thread scheduling that co-locates execution with the corresponding data segment. A double-buffered CXL prefetch pipeline further reduces the impact of the low-bandwidth and high-latency CXL path by staging upcoming blocks into DRAM. Experiments on the FAISS-Flat index using the SIFT1M and GloVe datasets demonstrate up to 12× performance improvement under high concurrency, with bandwidth-weighted segmentation and thread affinity contributing 11×, and CXL prefetching reducing remote access overhead by about 10%.
She et al. (Thu,) studied this question.