What question did this study set out to answer?

This study aims to compare the performance of global-memory and shared-memory kernels in simulating the Buckley-Leverett polymer-flooding problem using CUDA.

June 4, 2026Open Access

A CUDA Performance Study of Global- and Shared-Memory Kernels for the Buckley–Leverett Polymer-Flooding Problem

Key Points

This study aims to compare the performance of global-memory and shared-memory kernels in simulating the Buckley-Leverett polymer-flooding problem using CUDA.
Developed a GPU-resident parallel solver using Jacobi iteration and flux splitting within an Implicit-Pressure-Explicit-Saturation framework.
Profiled two CUDA implementations (global vs. shared memory) on an NVIDIA GeForce RTX 2080 Ti for varying problem and block sizes.
Analyzed kernel performance using Nsight Compute to classify compute vs. memory bandwidth utilization.
Both CUDA implementations matched the serial reference results within 2 × 10−8 accuracy.
Observed peak speed-ups of 20.2× for global-memory and 20.1× for shared-memory implementations.
Identified FP64 throughput as a bottleneck, with shared-memory tiling not improving performance.

Abstract

Polymer-augmented waterflooding is a key enhanced oil recovery technique whose simulation remains computationally demanding at a high spatial resolution. This paper presents a fully GPU-resident parallel solver for the one-dimensional Buckley–Leverett polymer-flooding problem within an Implicit-Pressure–Explicit-Saturation framework. The solver combines Jacobi iteration for pressure, first-order upwind flux splitting for saturation, and a first-order upwind flux-splitting update for polymer mass with explicit concentration recovery inside a coupled Picard–IMPES iteration. Two CUDA implementations are compared: a global-memory baseline and a shared-memory variant that stages a per-block pressure tile with halo cells on chip. Both kernels were profiled on an NVIDIA GeForce RTX 2080 Ti over problem sizes from N = 65,536 to N = 67,108,864 and block sizes 128, 256, 512, and 1024. The two GPU implementations match the serial reference within 2 × 10−8, and peak speed-ups are 20.2× (global) and 20.1× (shared). Per-kernel Nsight Compute profiling classifies every kernel in both builds as compute-bound: SM throughput is 54–83% of peak and DRAM throughput 3–29% of peak. The bottleneck is the FP64 pipeline of consumer Turing hardware (FP64 throughput is one thirty-second of FP32); three FP64 divisions per cell, from inline polymer-modified mobility recomputation, saturate the FP64 unit. Shared-memory tiling cannot improve performance because it acts on memory traffic rather than on compute throughput. The result therefore characterizes a specific regime, namely FP64 one-dimensional, low-reuse transport stencils on consumer-class NVIDIA GPUs with reduced FP64 throughput, and is not a universal property of CUDA shared memory.

A CUDA Performance Study of Global- and Shared-Memory Kernels for the Buckley–Leverett Polymer-Flooding Problem

Key Points

Abstract

Cite This Study

Also Consider

Also Consider