What question did this study set out to answer?

The aim is to enhance the efficiency of both segmented and non-segmented reduction operations on GPUs.

April 22, 2026Open Access

Optimized Parallel Reduction for Regular and Irregular Segments on GPU

Key Points

The aim is to enhance the efficiency of both segmented and non-segmented reduction operations on GPUs.
Developed six strategies for handling segments of different sizes.
Created a tuning algorithm for optimizing performance across various GPU architectures.
Conducted experiments on diverse GPU architectures to validate effectiveness.
Achieved over a 135× speedup for many small segments compared to existing algorithms.
Obtained up to a 98× speedup for fewer large segments based on the segment distribution.
Improved non-segmented reduction by up to 1.72 times over other GPU implementations.

Abstract

ABSTRACT Reduction is an operation that combines all the elements of a collection by applying a binary operation, such as sum, maximum, or minimum, to all the elements to obtain a single resulting value. This paper investigates implementation strategies for both segmented and non‐segmented reduction on GPUs. Existing techniques for segmented reduction often show consistent performance but are relatively inefficient in absolute terms. These techniques are frequently optimized for specific workloads and, as a result, may exhibit degraded performance with certain input data, especially when segments have varying sizes. The algorithm presented in this paper employs six different strategies for handling segments of varying sizes and automatically selects the most suitable one at runtime based on segment size. To ensure portability and consistent performance across GPU architectures, we developed a tuning algorithm that automatically optimizes for current or future GPUs. Experiments conducted on GPUs with diverse architectures demonstrate the effectiveness of this tuning approach and validate the algorithms' consistent performance. Overall, the proposed method achieves over a 135× speedup when processing a large number of small segments, and up to a 98× speedup for a small number of large segments, depending on the distribution, compared to segmented reduction algorithms in state‐of‐the‐art GPU libraries. Additionally, this paper also explores strategies to accelerate non‐segmented reduction, resulting in up to 1.72 times improvement compared to other parallel GPU implementations.

Optimized Parallel Reduction for Regular and Irregular Segments on GPU

Key Points

Abstract

Cite This Study