Faculty Publications

Large-scale parallel genome assembler over cloud computing environment

Arghya Kusum Das, 1 School of Electrical Engineering and Computer Science, Center for Computation and Technology, Louisiana State University, 340 East Parker Blvd, Baton Rouge, Louisiana 70803, USA.
Praveen Kumar Koppa, 1 School of Electrical Engineering and Computer Science, Center for Computation and Technology, Louisiana State University, 340 East Parker Blvd, Baton Rouge, Louisiana 70803, USA.
Sayan Goswami, 1 School of Electrical Engineering and Computer Science, Center for Computation and Technology, Louisiana State University, 340 East Parker Blvd, Baton Rouge, Louisiana 70803, USA.
Richard Platania, 1 School of Electrical Engineering and Computer Science, Center for Computation and Technology, Louisiana State University, 340 East Parker Blvd, Baton Rouge, Louisiana 70803, USA.
Seung-Jong Park, Geriatric Department, The Affiliated Guangji Hospital of Soochow University, Suzhou, Jiang Su Province, People's Republic of China, 215137.
Daniel R. Kapusta, Department of Pharmacology, Louisiana State University Health Sciences Center New Orelans, New Orleans, LA, 70112, USA.
Yongliang Lv, Geriatric Department, The Affiliated Guangji Hospital of Soochow University, Suzhou, Jiang Su Province, People's Republic of China, 215137. lylv@sohu.com.
Juan Gao, Department of Pharmacology, Louisiana State University Health Sciences Center New Orelans, New Orleans, LA, 70112, USA. jgao1@lsuhsc.edu.

Document Type

Article

Publication Date

6-1-2017

Abstract

The size of high throughput DNA sequencing data has already reached the terabyte scale. To manage this huge volume of data, many downstream sequencing applications started using locality-based computing over different cloud infrastructures to take advantage of elastic (pay as you go) resources at a lower cost. However, the locality-based programming model (e.g. MapReduce) is relatively new. Consequently, developing scalable data-intensive bioinformatics applications using this model and understanding the hardware environment that these applications require for good performance, both require further research. In this paper, we present a de Bruijn graph oriented Parallel Giraph-based Genome Assembler (GiGA), as well as the hardware platform required for its optimal performance. GiGA uses the power of Hadoop (MapReduce) and Giraph (large-scale graph analysis) to achieve high scalability over hundreds of compute nodes by collocating the computation and data. GiGA achieves significantly higher scalability with competitive assembly quality compared to contemporary parallel assemblers (e.g. ABySS and Contrail) over traditional HPC cluster. Moreover, we show that the performance of GiGA is significantly improved by using an SSD-based private cloud infrastructure over traditional HPC cluster. We observe that the performance of GiGA on 256 cores of this SSD-based cloud infrastructure closely matches that of 512 cores of traditional HPC cluster.

Publication Source (Journal or Book title)

Journal of bioinformatics and computational biology

First Page

1740003

Recommended Citation

Das, A. K., Koppa, P. K., Goswami, S., Platania, R., Park, S., Kapusta, D. R., Lv, Y., & Gao, J. (2017). Large-scale parallel genome assembler over cloud computing environment. Journal of bioinformatics and computational biology, 15 (3), 1740003. https://doi.org/10.1142/S0219720017400030

This document is currently not available here.

COinS

Faculty Publications

Large-scale parallel genome assembler over cloud computing environment

Document Type

Publication Date

Abstract

Publication Source (Journal or Book title)

First Page

Recommended Citation

Search

Browse

Author Corner

SPONSORED BY

Faculty Publications

Large-scale parallel genome assembler over cloud computing environment

Authors

Document Type

Publication Date

Abstract

Publication Source (Journal or Book title)

First Page

Recommended Citation

Share

Search

Browse

Author Corner

SPONSORED BY