Fully Functional Parameterized Suffix Trees in Compact Space

Document Type

Conference Proceeding

Publication Date

7-1-2022

Abstract

Two equal length strings are a parameterized match (p-match) iff there exists a one-to-one function that renames the symbols in one string to those in the other. The Parameterized Suffix Tree (PST) [Baker, STOC' 93] is a fundamental data structure that handles various string matching problems under this setting. The PST of a text T [1, n] over an alphabet Σ of size σ takes O(n log n) bits of space. It can report any entry in (parameterized) (i) suffix array, (ii) inverse suffix array, and (iii) longest common prefix (LCP) array in O(1) time. Given any pattern P as a query, a position i in T is an occurrence iff T [i, i + |P | − 1] and P are a p-match. The PST can count the number of occurrences of P in T in time O(|P | log σ) and then report each occurrence in time proportional to that of accessing a suffix array entry. An important question is, can we obtain a compressed version of PST that takes space close to the text's size of n log σ bits and still support all three functionalities mentioned earlier? In SODA' 17, Ganguly et al. answered this question partially by presenting an O(n log σ) bit index that can support (parameterized) suffix array and inverse suffix array operations in O(log n) time. However, the compression of the (parameterized) LCP array and the possibility of faster suffix array and inverse suffix array queries in compact space were left open. In this work, we obtain a compact representation of the (parameterized) LCP array. With this result, in conjunction with three new (parameterized) suffix array representations, we obtain the first set of PST representations in o(n log n) bits (when log σ = o(log n)) as follows. Here ε > 0 is an arbitrarily small constant. Space O(n log σ) bits and query time O(logεσ n); Space O(n log σ log logσ n) bits and query time O(log logσ n); and Space O(n log σ logεσ n) bits and query time O(1). The first trade-off is an improvement over Ganguly et al.'s result, whereas our third trade-off matches the optimal time performance of Baker's PST while squeezing the space by a factor roughly logσ n. We highlight that our trade-offs match the space-and-time bounds of the best-known compressed text indexes for exact pattern matching and further improvement is highly unlikely.

Publication Source (Journal or Book title)

Leibniz International Proceedings in Informatics, LIPIcs

This document is currently not available here.

Share

COinS