What type of study is this?

This is a Quantitative Study study.

September 30, 2025Open Access

Ultrafast Generative AI by Ultradense 3D Integration: A Case Study on LLM-based Edge Inference

KAKerem AkarvardarTaiwan Semiconductor Manufacturing Company (United States)XSXiaoyu SunTaiwan Semiconductor Manufacturing Company (United States)BCBrian Crafton

Key Points

Boosts autoregressive generation rate to exceed 5K tokens/sec with ultradense 3D architecture.
Implements a novel data mapping strategy that significantly enhances memory bandwidth and weight locality.
Addresses challenges of memory access power and density while quantifying latency reductions from advanced logic nodes.
Presents new pipelined access strategies to optimize performance under extreme bandwidth conditions.

Abstract

Generative AI (GenAI) is one of the most critical applications today, continually challenging the limits of semiconductor technology. We introduce a very fine-grained 3D memory-on-logic architecture along with a novel data mapping strategy to support Large Language Model (LLM)-based GenAI, including both prefill and generation stages. Our conceptual analysis shows how ultradense 3D connectivity can enhance text generation speed and energy-efficiency well-beyond current limits. Preliminary findings from a basic analytical model indicate that the single batch autoregressive generation rate for Llama 3.2 1B could surpass 5K tokens/sec by maximizing weight locality and enhancing memory bandwidth through massively parallel 3D links between Multiply-Accumulate (MAC) units in the logic tier and their dedicated memory partitions in the 3D stack. We also explore the impact of advanced logic nodes and quantify their benefits in reducing prefill latency. Finally, we examine the challenges associated with memory access power and power density under extreme bandwidth conditions and present pipelined access strategies to address them.

Ask AI

Helpful

Bookmark

View Full Paper