Key points are not available for this paper at this time.
A task array is a 2-dimensional array of tasks with dependency relations. Each task uses the resulting values of some tasks in the left columns, and so it can be started only after these left tasks are completed. Conventional CUDA implementations repeatedly perform a separated CUDA kernel call for each column from left to right to synchronize the computation for tasks. However, this conventional CUDA implementation has several drawbacks: a CUDA kernel call has a certain overhead, and the running time of a CUDA kernel is determined by a CUDA block that terminates lastly. Also, every task must write and preserve the resulting values in the global memory with low memory access performance for the following tasks. The main contribution of this paper is to introduce task arrays and to present Single Kernel Soft Synchronization (SKSS) technique that significantly reduces such overheads for task arrays. The SKSS performs only one CUDA kernel call and CUDA blocks assigned to each row of a task array using a global counter. To clarify the potentiality of our SKSS technique, we have implemented the dynamic programming for the 0-1 knapsack problem, the summed area table computation, and the error diffusion of a gray-scale image using our SKSS technique and compared with previously published best GPU implementations. Quite surprisingly, the experimental results using NVIDIA Titan X show that, our SKSS implementations are 1.29-2.11 times faster for the 0-1 knapsack problem, 1.08-1.56 times faster for the summed area table computation, and 1.61-2.11 times faster for the error diffusion.
Funasaka et al. (Wed,) studied this question.