Generative AI (GenAI) is one of the most critical applications today, continually challenging the limits of semiconductor technology. We introduce a very fine-grained 3D memory-on-logic architecture along with a novel data mapping strategy to support Large Language Model (LLM)-based GenAI, including both prefill and generation stages. Our conceptual analysis shows how ultradense 3D connectivity can enhance text generation speed and energy-efficiency well-beyond current limits. Preliminary findings from a basic analytical model indicate that the single batch autoregressive generation rate for Llama 3.2 1B could surpass 5K tokens/sec by maximizing weight locality and enhancing memory bandwidth through massively parallel 3D links between Multiply-Accumulate (MAC) units in the logic tier and their dedicated memory partitions in the 3D stack. We also explore the impact of advanced logic nodes and quantify their benefits in reducing prefill latency. Finally, we examine the challenges associated with memory access power and power density under extreme bandwidth conditions and present pipelined access strategies to address them.
Building similarity graph...
Analyzing shared references across papers
Loading...
Kerem Akarvardar
Xiaoyu Sun
Brian Crafton
ACM Transactions on Design Automation of Electronic Systems
Stanford University
Taiwan Semiconductor Manufacturing Company (Taiwan)
Taiwan Semiconductor Manufacturing Company (United States)
Building similarity graph...
Analyzing shared references across papers
Loading...
Akarvardar et al. (Fri,) studied this question.
www.synapsesocial.com/papers/68dc1e308a7d58c25ebb1542 — DOI: https://doi.org/10.1145/3768168