Sparse data structures are ubiquitous in graph analytics, machine learning, and high-performance computing. Algorithms operating on these structures typically exhibit highly irregular data-dependent memory access (DDMA) patterns, leading to frequent cache misses and degraded memory performance. Prior work on hardware prefetching to mitigate DDMA-induced misses falls into two categories: address-based methods that sample correlated sequences of load data and addresses from cache miss streams, and instruction-based methods that record instruction-level dependency chains. Although both learn single relations effectively, they struggle with multi-level range relations prevalent in DDMA-intensive workloads, leaving substantial prefetching opportunities unexploited. In address-based schemes, misses from deeper-level consumers are often miscorrelated with the producer. Moreover, out-of-order execution and the range relations themselves perturb sampling, yielding mismatched load instances. In instruction-based schemes, chain-structured representations and suboptimal learning strategies prevent the construction of complete dependency chains for these relations. To overcome these limitations, we present Thoth, a hardware prefetcher that operates at the granularity of explicit producer-consumer load pairs rather than constructing dependency chains. Thoth detects such pairs via register-level dependency tracking. It adopts an annotation-directed load sampling strategy that annotates matched producer-consumer load instances and samples only those annotated instances, thereby robustly uncovering DDMA patterns—including multi-level range relations—while avoiding mismatches. To maintain annotation correctness across pipeline flushes, Thoth employs precise load annotation, which leverages reorder identifiers to resume or terminate annotation precisely. On a suite of DDMA-intensive benchmarks, Thoth delivers a 51.1% speedup over a no-prefetching baseline and outperforms two state-of-the-art DDMA prefetchers by 14.7% and 8.2%, respectively.
Jiang et al. (Fri,) studied this question.