An Experimental Study of a Biosequence Big Data Analysis Service

Document Type

Conference Proceeding

Publication Date

9-7-2017

Abstract

With the development of next-generation sequencing (NGS), DNA/RNA sequencing has become cheaper and more efficient. Today, a whole human genome can be sequenced under $1,000, providing opportunities for large-scale bioinformatic analysis on big datasets. However, most of existing bioinformatic analysis tools are programmed for single server based computing platform and not suitable to process such big datasets. As Hadoop MapReduce and Spark are gaining popularity as cluster computing based big data processing platform, more and more bioinformatic applications start to explore cluster computing platform for large scale data analysis. In this paper we present an in-depth experimental study on deploying Spark clusters for high performance bioinformatic short sequence reconstruction. Our experimental results enable us to answer a number of challenging and yet most frequently asked questions regarding efficient management of bioinformatic data analysis services on Spark systems. Example questions include how to best split big dataset into multiple partitions, and how to distribute data partitions and bioinformatic analysis tasks on a Spark cluster for carrying out a high performance distributed analysis job? What types of memory models are effective for bioinformatic data analysis services on a Spark cluster? Why do different bioinformatic data analysis operations exhibit different throughput performance on the same Spark cluster? We conjecture that this experimental study not only demonstrates the feasibility of high performance bioinformatic data analysis on Spark platform, but also will help bioinformatic application developers to make more informed decisions on both design and configuration of Spark Cluster, managing and tuning parameters of Spark runtime system for enhancing the performance of large scale big data analytics.

Publication Source (Journal or Book title)

Proceedings - 2017 IEEE 24th International Conference on Web Services, ICWS 2017

First Page

237

Last Page

244

This document is currently not available here.

Plum Print visual indicator of research metrics
PlumX Metrics
  • Citations
    • Citation Indexes: 2
  • Usage
    • Abstract Views: 1
  • Captures
    • Readers: 6
see details

Share

COinS