Polymer-augmented waterflooding is a key enhanced oil recovery technique whose simulation remains computationally demanding at a high spatial resolution. This paper presents a fully GPU-resident parallel solver for the one-dimensional Buckley–Leverett polymer-flooding problem within an Implicit-Pressure–Explicit-Saturation framework. The solver combines Jacobi iteration for pressure, first-order upwind flux splitting for saturation, and a first-order upwind flux-splitting update for polymer mass with explicit concentration recovery inside a coupled Picard–IMPES iteration. Two CUDA implementations are compared: a global-memory baseline and a shared-memory variant that stages a per-block pressure tile with halo cells on chip. Both kernels were profiled on an NVIDIA GeForce RTX 2080 Ti over problem sizes from N = 65,536 to N = 67,108,864 and block sizes 128, 256, 512, and 1024. The two GPU implementations match the serial reference within 2 × 10−8, and peak speed-ups are 20.2× (global) and 20.1× (shared). Per-kernel Nsight Compute profiling classifies every kernel in both builds as compute-bound: SM throughput is 54–83% of peak and DRAM throughput 3–29% of peak. The bottleneck is the FP64 pipeline of consumer Turing hardware (FP64 throughput is one thirty-second of FP32); three FP64 divisions per cell, from inline polymer-modified mobility recomputation, saturate the FP64 unit. Shared-memory tiling cannot improve performance because it acts on memory traffic rather than on compute throughput. The result therefore characterizes a specific regime, namely FP64 one-dimensional, low-reuse transport stencils on consumer-class NVIDIA GPUs with reduced FP64 throughput, and is not a universal property of CUDA shared memory.
Makhmut et al. (Sat,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: