The H-P sort is a known algorithm for integer sorting using histogram computation and prefix sum calculation. In this paper, first, by implementing a naive GPU version of H-P sort, we demonstrate that it outperforms the fastest standard library (CUB sort) when the input data range is small compared to the input size. Next, by improving the histogram and prefix-sum algorithms, the optimized algorithm becomes faster than the naive version and, in some aspects where the H-P sort had previously been inferior to the CUB sort, it now outperforms.
TAKASE et al. (Thu,) studied this question.