Doctor of Philosophy (PhD)
Department of Computer Science
Genome sequencing technology has witnessed tremendous progress in terms of throughput and cost per base pair, resulting in an explosion in the size of data. Typical de Bruijn graph-based assembly tools demand a lot of processing power and memory and cannot assemble big datasets unless running on a scaled-up server with terabytes of RAMs or scaled-out cluster with several dozens of nodes. In the first part of this work, we present a distributed next-generation sequence (NGS) assembler called Lazer, that achieves both scalability and memory efficiency by using partitioned de Bruijn graphs. By enhancing the memory-to-disk swapping and reducing the network communication in the cluster, we can assemble large sequences such as human genomes (~400 GB) on just two nodes in 14.5 hours, and also scale up to 128 nodes in 23 minutes. We also assemble a synthetic wheat genome with 1.1 TB of raw reads on 8 nodes in 18.5 hours and on 128 nodes in 1.25 hours. In the second part, we present a new distributed GPU-accelerated NGS assembler called LaSAGNA, which can assemble large-scale sequence datasets using a single GPU by building string graphs from approximate all-pair overlaps in quasi-linear time. To use the limited memory on GPUs efficiently, LaSAGNA uses a two-level semi-streaming approach from disk through host memory to device memory with restricted access patterns on both disk and host memory. Using LaSAGNA, we can assemble the human genome dataset on a single NVIDIA K40 GPU in 17 hours, and in a little over 5 hours on an 8-node cluster of NVIDIA K20s. In the third part, we present the first distributed 3rd generation sequence (3GS) assembler which uses a map-reduce computing paradigm and a distributed hash-map, both built on a high-performance networking middleware. Using this assembler, we assembled an Oxford Nanopore human genome dataset (~150 GB) in just over half an hour using 128 nodes whereas existing 3GS assemblers could not assemble it because of memory and/or time limitations.
Goswami, Sayan, "High-Performance Computing Frameworks for Large-Scale Genome Assembly" (2019). LSU Doctoral Dissertations. 4942.