Lazer: Distributed memory-efficient assembly of large-scale genomes

Document Type

Conference Proceeding

Publication Date

1-1-2016

Abstract

Genome sequencing technology has witnessed tremendous progress in terms of throughput as well as cost per base pair, resulting in an explosion in the size of data. Consequently, typical sequence assembly tools demand a lot of processing power and memory and are unable to assemble big datasets unless run on hundreds of nodes. In this paper, we present a distributed assembler that achieves both scalability and memory efficiency by using partitioned de Bruijn graphs. By enhancing the memory-to-disk swapping and reducing the network communication in the cluster, we can assemble large sequences such as human genomes (452 GB) on just two nodes in 14.5 hours, and also scale up to 128 nodes in 23 minutes. We also assemble a synthetic wheat genome with 1.1 TB of raw reads on 8 nodes in 18.5 hours and on 128 nodes in 1.25 hours.

Publication Source (Journal or Book title)

Proceedings - 2016 IEEE International Conference on Big Data, Big Data 2016

First Page

1171

Last Page

1181

This document is currently not available here.

Share

COinS