Split tiling for GPUs: Automatic parallelization using trapezoidal tiles
Document Type
Conference Proceeding
Publication Date
4-15-2013
Abstract
Tiling is a key technique to enhance data reuse. For computations structured as one sequential outer "time" loop enclosing a set of parallel inner loops, tiling only the parallel inner loops may not enable enough data reuse in the cache. Tiling the inner loops along with the outer time loop enhances data locality but may require other transformations like loop skewing that inhibit inter-tile parallelism. One approach to tiling that enhances data locality without inhibiting inter-tile parallelism is split tiling, where tiles are subdivided into a sequence of trapezoidal computation steps. In this paper, we develop an approach to generate split tiled code for GPUs in the PPCG polyhedral code generator. We propose a generic algorithm to calculate index-set splitting that enables us to perform tiling for locality and synchronization avoidance, while simultaneously maintaining parallelism, without the need for skewing or redundant computations. Our algorithm performs split tiling for an arbitrary number of dimensions and without the need to construct any large integer linear program. The method and its implementation are evaluated on standard stencil kernels and compared with a state-of-the-art polyhedral compiler and with a domain-specific stencil compiler, both targeting CUDA GPUs. Copyright 2013 ACM.
Publication Source (Journal or Book title)
ACM International Conference Proceeding Series
First Page
24
Last Page
31
Recommended Citation
Grosser, T., Cohen, A., Kelly, P., Ramanujam, J., Sadayappan, P., & Verdoolaege, S. (2013). Split tiling for GPUs: Automatic parallelization using trapezoidal tiles. ACM International Conference Proceeding Series, 24-31. https://doi.org/10.1145/2458523.2458526