Efficient Substring Discovery Using Suffix, LCP Array and Algorithm-Architecture Interaction
Doctor of Philosophy (PhD)
Preprocessing of database is inevitable to extract information from large databases like biological sequences of gene or protein. Discovery of patterns becomes very time efficient when we preprocess the database in the form suffix array. Due to inherent organization of data in suffix array and it’s secondary data structure longest common prefix (LCP) array (Manber and Myers 1990) only a limited portion of the database is accessed during the searching operation which results in outcome of plenty of information in very less amount of time depending on the size of the database. Unlike exact pattern matching here we preprocess the database instead of pattern. We found suffix and LCP array as a perfect tool to compute N-grams (substring) in various dimensions. Since past couple of decades there has been significant research on construction of suffix and LCP array. Comparatively the research of properly utilizing this prospective data structures to retrieve the substring information from various perspectives have remained almost unfocussed. Our main focus in this work was to develop a number of algorithms for computing present and missing N-grams in a text in linear time and present them non-redundantly for large databases. Finding information of present and missing N-grams and their time efficient non-redundant representation in large genome sequences can lead to new discovery in biology in the future. We have implemented and applied all our algorithms on various genome and proteome sequences and found interesting results. They were also tested for performance and other hardware parameter measurements on various platforms in order to suggest appropriate architecture for this kind of application.
Document Availability at the Time of Submission
Secure the entire work for patent and/or proprietary purposes for a period of one year. Student has submitted appropriate documentation which states: During this period the copyright owner also agrees not to exercise her/his ownership rights, including public use in works, without prior authorization from LSU. At the end of the one year period, either we or LSU may request an automatic extension for one additional year. At the end of the one year secure period (or its extension, if such is requested), the work will be released for access worldwide.
Poddar, Anindya, "Efficient Substring Discovery Using Suffix, LCP Array and Algorithm-Architecture Interaction" (2011). LSU Doctoral Dissertations. 490.
Kraft, Donald H.