Doctor of Philosophy (PhD)
Electrical and Computer Engineering
As the complexity of recent and future large-scale data and exascale systems architectures grows, so do productivity, portability, software scalability, and efficient utilization of system resources challenges presented to both industry and the research community. Software solutions and applications are expected to scale in performance on such complex systems. Asynchronous many-task (AMT) systems, taking advantage of multi-core architectures with light-weight threads, asynchronous executions, and smart scheduling, are showing promise in addressing these challenges.
In this research, we implement several scalable and distributed applications based on HPX, an exemplar AMT runtime system. First, a distributed HPX implementation for a parameterized benchmark Task Bench is introduced. The performance bottleneck is analyzed where the repeated HPX threads creation costs and a global barrier for all threads limit the performance. The methodologies to retain the spawning threads alive and overlap communication and computation are presented. The evaluation results prove the effectiveness of the improved approach, where HPX is comparable with the prevalent programming models and takes advantages of multi-task scenarios. Second, an algorithms and data-structures SHAD library with HPX support is introduced. The methodologies to support local and remote operations in synchronous and asynchronous manners are developed. The HPX implementation in support of the SHAD library is further provided. Performance results demonstrate that the proposed system presents the similar performance as SHAD with Intel TBB (Threading Building Blocks) support for shared-memory parallelism and is better to explore the distributed-memory parallelism than SHAD with GMT (Global Memory and Threading) support. Third, an asynchronous array processing framework Phylanx is introduced. The methodologies that support a distributed alternating least square algorithm are developed. The implementation of this algorithm along with a number of distributed primitives are provided. The performance results show that Phylanx implementation presents a good scalability. Finally, a scalable second-order method for optimization is introduced. The implementation of a Krylov-Newton second-order method via PyTorch framework is provided. Evaluation results illustrate the effectiveness of scalability, convergence, and robust to hyper-parameters of the proposed method.
Wu, Nanmiao, "Performance Analysis and Improvement for Scalable and Distributed Applications Based on Asynchronous Many-Task Systems" (2022). LSU Doctoral Dissertations. 5781.