Document Type
Conference Proceeding
Publication Date
5-12-2011
Abstract
Given a set D of d patterns of total length n, the dictionary matching problem is to index D such that for any query text T, we can locate the occurrences of any pattern within T efficiently. This problem can be solved in optimal O(|T|+occ) time by the classical AC automaton (Aho and Corasick, 1975) where occ denotes the number of occurrences. The space requirement is O(n) words. In the \emph{approximate} dictionary matching problem with one error, we consider a substring of T[i.j] an occurrence of P whenever the edit distance between T[i.j] and P is at most one. For this problem, the best known indexes are by Cole et al. (2004), which requires O(n+ d\log+d) words of space and reports all occurrences in O(|T|\log{d}\log{\log{d}+occ) time, and by Ferragina et al. (1999), which requires O(n1+\epsilon}) words of space and reports all occurrences in O(|T|\log\log n + occ) time. Recently, there have been successes in compressing the dictionary matching index while keeping the query time optimal (Belazzougui, 2010, Hon et al., 2010). However, a compressed index for approximate dictionary matching problem is still open. In this paper, we propose the first such index which requires an optimal nHk+O(n)+o(n\log\ sigma)-bit index space, where H-k denotes the kth-order empirical entropy of D, and sigma is the size of alphabet set from which all the characters in D and T are drawn. The query time of our index is O(\σ |T|\log3{n}\ log{\logn+occ). © 2011 IEEE.
Publication Source (Journal or Book title)
Data Compression Conference Proceedings
First Page
113
Last Page
122
Recommended Citation
Hon, W., Ku, T., Shah, R., Thankachan, S., & Vitter, J. (2011). Compressed dictionary matching with one error. Data Compression Conference Proceedings, 113-122. https://doi.org/10.1109/DCC.2011.18