Compressed property suffix trees

Document Type

Conference Proceeding

Publication Date

5-12-2011

Abstract

Property matching is a biologically motivated problem where the task is to find those occurrences of an online pattern P in a string text T (of size n), such that the matched part in T satisfies some conceptual property. The property of a string is a set pi of (possibly overlapping) intervals {(s1, f1), (s2, f2), \ldots\} corresponding to the part of T, and an occurrence of a pattern P at T[i.(i+|P|-1)] is a valid output under the property pi only if T[i.(i+|P|-1)] is completely contained in some interval (sj,fj) \in \pi. Algorithmically this problem can be solved in time linear to the size of text. Amir et al. (2008) introduced the indexing version of this problem, where they preprocess the text in O(n\log\sigma+n\log\log n) time and maintain an O(n\log n) bits index, named \emph{Property Suffix Tree} (PST), where sigma denotes the alphabet size. PST can perform property matching in optimal O(|P|\log\sigma+ occ{\pi}) time, where occ{\pi} is the number of occurrences of P in T which satisfies the property. Later, Iliopoulos and Rahman (2008) proposed an alternative index which can be constructed in linear time. Recently Kopelowitz (2010) considered the dynamic version of this problem where intervals can be added or deleted. However, all these indexes requires space of O(n\log n) bits, which can be much more than the size of the text (n\log\sigma bits). In this paper, we propose the first index for property matching which takes space close to the entropy compressed space requirement of the text. Our compressed index takes |CSA|+n(2+\epsilon+o(1)) bits space and can perform query answering in O(t(|P|)+\frac{1}{\epsilon}occ{\pi} t{SA}) time, where |CSA| is the size of compressed suffix array (CSA), t(|P|) and t{SA} are the time for searching a pattern of length |P| and the time for computing the suffix array value using CSA, respectively, and epsilon is a constant. We also introduce a dynamic index, taking |CSA|+ n(2+ε+o(1))+O(|\ε|\log n) bits of space, which can perform query answering in O(t(|P|)+ epsilon}occ{ε} (\log n/\log\log n +t{SA})\log n) time and can update (insert or delete) an interval (s,f) in O((f-s+1)(\log n+\log|\pi|ε{tSA) time. © 2011 IEEE.

Publication Source (Journal or Book title)

Data Compression Conference Proceedings

First Page

123

Last Page

132

This document is currently not available here.

Share

COinS