String retrieval for multi-pattern queries
Document Type
Conference Proceeding
Publication Date
11-24-2010
Abstract
Given a collection D of string documents {d1, s2,..., d|D|} of total length n, which may be preprocessed, a fundamental task is to retrieve the most relevant documents for a given query. The query consists of a set of m patterns {P1, P2, ..., P m }. To measure the relevance of a document with respect to the query patterns, we may define a score, such as the number of occurrences of these patterns in the document, or the proximity of the given patterns within the document. To control the size of the output, we may also specify a threshold (or a parameter K), so that our task is to report all the documents which match the query with score more than threshold (or respectively, the K documents with the highest scores). When the documents are strings (without word boundaries), the traditional inverted-index-based solutions may not be applicable. The single pattern retrieval case has been well-solved by [14,9]. When it comes to two or more patterns, the only non-trivial solution for proximity search and common document listing was given by [14], which took Õ(n3/2) space. In this paper, we give the first linear space (and partly succinct) data structures, which can answer multi-pattern queries in O(∑|Pi|) + Õ(t1/mn1-1/m time, where t is the number of output occurrences. In the particular case of two patterns, we achieve the bound of O(|P1| + |P2| + √nt log2 n). We also show space-time trade-offs for our data structures. Our approach is based on a novel data structure called the weight-balanced wavelet tree, which may be of independent interest. © 2010 Springer-Verlag.
Publication Source (Journal or Book title)
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
First Page
55
Last Page
66
Recommended Citation
Hon, W., Shah, R., Thankachan, S., & Vitter, J. (2010). String retrieval for multi-pattern queries. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 6393 LNCS, 55-66. https://doi.org/10.1007/978-3-642-16321-0_6