LSU Doctoral Dissertations

Performance Improvement of Distributed Computing Framework and Scientific Big Data Analysis

Identifier

etd-07092014-152032

Praveenkumar Kondikoppa, Louisiana State University and Agricultural and Mechanical CollegeFollow

Degree

Doctor of Philosophy (PhD)

Department

Computer Science

Document Type

Dissertation

Abstract

Analysis of Big data to gain better insights has been the focus of researchers in the recent past. Traditional desktop computers or database management systems may not be suitable for efficient and timely analysis, due to the requirement of massive parallel processing. Distributed computing frameworks are being explored as a viable solution. For example, Google proposed MapReduce, which is becoming a de facto computing architecture for Big data solutions. However, scheduling in MapReduce is coarse grained and remains as a challenge for improvement. Related with MapReduce scheduler when configured over distributed clusters, we identify two issues: data locality disruption and random assignment of non-local map tasks. We propose a network aware scheduler to extend the existing rack awareness. The tasks are scheduled in the order of node, rack and any other rack within the same cluster to achieve cluster level data locality. The issue of random assignment non-local map tasks is handled by enhancing the scheduler to consider the network parameters, such as delay, bandwidth and packet loss between remote clusters. As part of Big data analysis at computational biology, we consider two major data intensive applications: indexing genome sequences and de Novo assembly. Both of these applications deal with the massive amount data generated from DNA sequencers. We developed a scalable algorithm to construct sub-trees of a suffix tree in parallel to address huge memory requirements needed for indexing the human genome. For the de Novo assembly, we propose Parallel Giraph based Assembler (PGA) to address the challenges associated with the assembly of large genomes over commodity hardware. PGA uses the de Bruijn graph to represent the data generated from sequencers. Huge memory demands and performance expectations are addressed by developing parallel algorithms based on the distributed graph-processing framework, Apache Giraph.

Date

2014

Document Availability at the Time of Submission

Secure the entire work for patent and/or proprietary purposes for a period of one year. Student has submitted appropriate documentation which states: During this period the copyright owner also agrees not to exercise her/his ownership rights, including public use in works, without prior authorization from LSU. At the end of the one year period, either we or LSU may request an automatic extension for one additional year. At the end of the one year secure period (or its extension, if such is requested), the work will be released for access worldwide.

Recommended Citation

Kondikoppa, Praveenkumar, "Performance Improvement of Distributed Computing Framework and Scientific Big Data Analysis" (2014). LSU Doctoral Dissertations. 3687.
https://repository.lsu.edu/gradschool_dissertations/3687

Committee Chair

Park, Seung-Jong

DOI

10.31390/gradschool_dissertations.3687

Download

Included in

Computer Sciences Commons

COinS

LSU Doctoral Dissertations

Performance Improvement of Distributed Computing Framework and Scientific Big Data Analysis

Identifier

Degree

Department

Document Type

Abstract

Date

Document Availability at the Time of Submission

Recommended Citation

Committee Chair

DOI

Included in

Search

Browse

Author Corner

SPONSORED BY

LSU Doctoral Dissertations

Performance Improvement of Distributed Computing Framework and Scientific Big Data Analysis

Identifier

Author

Degree

Department

Document Type

Abstract

Date

Document Availability at the Time of Submission

Recommended Citation

Committee Chair

DOI

Included in

Share

Search

Browse

Author Corner

SPONSORED BY