What question did this study set out to answer?

The aim is to optimize CUDA code by transforming synchronous operations into asynchronous ones while preserving data dependencies.

March 28, 2026

StreamAlloc: A Framework for Analyzing and Transforming CUDA Code to Enable Asynchronous Execution

Key Points

The aim is to optimize CUDA code by transforming synchronous operations into asynchronous ones while preserving data dependencies.
Proposed a novel technique called sync2async for transforming synchronous calls into asynchronous ones.
Developed StreamAlloc framework with four main components for data flow analysis.
Implemented inter-procedural and intra-procedural analyses to identify asynchronous opportunities.
Utilized LLVM/Clang for framework implementation.
Achieved geometric mean speedups of 1.49x, 1.63x, and 2.02x on P100, A4000, and A100 GPUs respectively.
Successfully identified and transformed numerous synchronous calls to asynchronous ones, improving execution efficiency.

Abstract

In the CUDA programming model, data transfers on the default stream are synchronous, and, similarly, device kernels launched on the default stream cannot overlap with other kernel computations and data transfers. Overlapping execution can be enabled using asynchronous APIs and streams in CUDA. Using them, however, requires careful handling of data dependencies across multiple data-transfer calls, host operations, and kernel computations to ensure program correctness. Moreover, numerous data transfer calls and kernel calls in a program make it even more challenging to manually assign the appropriate stream identifier for each such call. This challenge remains daunting for non-expert programmers because they lack the right tools and expertise. To address this, we propose sync2async, a novel optimization technique that transforms synchronous data transfers and kernel launches into non-default-stream asynchronous calls by allocating stream identifiers (and adding stream synchronizations at appropriate places) to maximize parallelizability while preserving dependencies. To identify sync2async opportunities and apply transformations, we introduce StreamAlloc, a data-flow-analysis-based framework with four components: (1) inter-procedural compositional read-write analysis to identify variables read and written at call sites, (2) intra-procedural flow-sensitive Can-Run-Asynchronously (CRA) analysis to detect data-transfer and kernel calls that can run asynchronously, (3) Data Flow Stream Assignment (DFSA) algorithm to schedule such asynchronous calls to different non-default streams, and (4) a transformation framework to apply sync2async and automatically optimize the input program. We have implemented StreamAlloc using LLVM/Clang. On P100, A4000, and A100 GPUs, sync2async achieves geomean speedups of 1.49x, 1.63x, and 2.02x over the baseline, respectively.

Bookmark

StreamAlloc: A Framework for Analyzing and Transforming CUDA Code to Enable Asynchronous Execution

Key Points

Abstract

Cite This Study