Degree

Doctor of Philosophy (PhD)

Department

Electrical and Computer Engineering

Document Type

Dissertation

Abstract

Over the past decade, high-performance deep learning has evolved into a critical research domain, driven by the demand for efficient models and high inference throughput. Deep learning architectures have shifted from stacked convolutional layers to transformer-based models, while pruning techniques and graph-structured data have established sparse matrix–dense matrix multiplication (SpMM) as a fundamental kernel—particularly in graph neural networks (GNNs). Modern GPUs, with their massive parallelism and high-bandwidth memory, offer immense potential for accelerating these workloads. While SpMM implementations using the compressed sparse row (CSR) format remain common to avoid conversion overhead, preprocessing-based methods have recently demonstrated superior potential. In GNN contexts, this preprocessing cost is amortized effectively across all layers and training iterations.

Despite these advantages, optimizing preprocessed SpMM on GPUs remains a significant challenge. The primary obstacle is achieving high data reuse across Cooperative Thread Arrays (CTAs) while maintaining workload balance across Stream Multiprocessors (SMs). Real-world sparse datasets typically exhibit random node connections, resulting in highly irregular memory access patterns. Although tiling can improve data locality, its efficacy is limited by two factors. First, intra-tile reuse is constrained by inherent sparsity: sparser connections necessitate larger tile heights, which increases memory footprint and induces cache thrashing. Second, conventional static tiling strategies fail to exploit inter-CTA data reuse in the L1 cache due to unpredictable CTA scheduling. Furthermore, the irregular structure of sparse matrices causes substantial workload disparity among tiles, degrading performance via inter-SM load imbalance.

To address these issues, we propose a novel cache-conscious scheduling framework for SpMM on GPUs. This framework integrates two core strategies: non-uniform tiling, which prioritizes partitioning dense non-zero clusters to preserve L1 locality, and wing enablement, a technique designed to maximize inter-block data reuse. Coordinated by our scheduling mechanism, these strategies synergistically enhance L1 reuse while ensuring load balancing. This approach effectively resolves the critical trade-off between cache locality and inter-SM workload distribution. Evaluated on GNN-derived sparse matrices using an NVIDIA H100 GPU, our method delivers significant performance improvements over the highly optimized cuSPARSE library.

Date

12-22-2025

Committee Chair

Koppelman, David

Share

COinS