ABSTRACT Reduction is an operation that combines all the elements of a collection by applying a binary operation, such as sum, maximum, or minimum, to all the elements to obtain a single resulting value. This paper investigates implementation strategies for both segmented and non‐segmented reduction on GPUs. Existing techniques for segmented reduction often show consistent performance but are relatively inefficient in absolute terms. These techniques are frequently optimized for specific workloads and, as a result, may exhibit degraded performance with certain input data, especially when segments have varying sizes. The algorithm presented in this paper employs six different strategies for handling segments of varying sizes and automatically selects the most suitable one at runtime based on segment size. To ensure portability and consistent performance across GPU architectures, we developed a tuning algorithm that automatically optimizes for current or future GPUs. Experiments conducted on GPUs with diverse architectures demonstrate the effectiveness of this tuning approach and validate the algorithms' consistent performance. Overall, the proposed method achieves over a 135× speedup when processing a large number of small segments, and up to a 98× speedup for a small number of large segments, depending on the distribution, compared to segmented reduction algorithms in state‐of‐the‐art GPU libraries. Additionally, this paper also explores strategies to accelerate non‐segmented reduction, resulting in up to 1.72 times improvement compared to other parallel GPU implementations.
Cordeiro et al. (Wed,) studied this question.