An Experimental Study of a Biosequence Big Data Analysis Service
Document Type
Conference Proceeding
Publication Date
9-7-2017
Abstract
With the development of next-generation sequencing (NGS), DNA/RNA sequencing has become cheaper and more efficient. Today, a whole human genome can be sequenced under $1,000, providing opportunities for large-scale bioinformatic analysis on big datasets. However, most of existing bioinformatic analysis tools are programmed for single server based computing platform and not suitable to process such big datasets. As Hadoop MapReduce and Spark are gaining popularity as cluster computing based big data processing platform, more and more bioinformatic applications start to explore cluster computing platform for large scale data analysis. In this paper we present an in-depth experimental study on deploying Spark clusters for high performance bioinformatic short sequence reconstruction. Our experimental results enable us to answer a number of challenging and yet most frequently asked questions regarding efficient management of bioinformatic data analysis services on Spark systems. Example questions include how to best split big dataset into multiple partitions, and how to distribute data partitions and bioinformatic analysis tasks on a Spark cluster for carrying out a high performance distributed analysis job? What types of memory models are effective for bioinformatic data analysis services on a Spark cluster? Why do different bioinformatic data analysis operations exhibit different throughput performance on the same Spark cluster? We conjecture that this experimental study not only demonstrates the feasibility of high performance bioinformatic data analysis on Spark platform, but also will help bioinformatic application developers to make more informed decisions on both design and configuration of Spark Cluster, managing and tuning parameters of Spark runtime system for enhancing the performance of large scale big data analytics.
Publication Source (Journal or Book title)
Proceedings - 2017 IEEE 24th International Conference on Web Services, ICWS 2017
First Page
237
Last Page
244
Recommended Citation
Zhou, W., Liu, L., Pu, C., Zhu, T., Wang, Q., Xiang, W., & Yao, S. (2017). An Experimental Study of a Biosequence Big Data Analysis Service. Proceedings - 2017 IEEE 24th International Conference on Web Services, ICWS 2017, 237-244. https://doi.org/10.1109/ICWS.2017.38