Document Type
Article
Publication Date
4-6-2020
Abstract
Let D be a collection of string documents of n characters in total. The top-k document retrieval problem is to preprocess D into a data structure that, given a query (P,k), can return the k documents of D most relevant to pattern P. The relevance of a document d for a pattern P is given by a predefined ranking function w(P,d). Linear space and optimal query time solutions already exist for this problem. In this paper we consider a novel problem, document selection, in which a query (P,k) aims to report the kth document most relevant to P (instead of reporting all top-k documents). We present a data structure using O(nlogϵn) space, for any constant ϵ>0, answering selection queries in time O(logk/loglogn), and a linear-space data structure answering queries in time O(logk), given the locus node of P in a (generalized) suffix tree of D. We also prove that it is unlikely that a succinct-space solution for this problem exists with poly-logarithmic query time, and that O(logk/loglogn) is indeed optimal within O(npolylogn) space for most text families. Finally, we present some additional space-time trade-offs exploring the extremes of those lower bounds.
Publication Source (Journal or Book title)
Theoretical Computer Science
First Page
149
Last Page
159
Recommended Citation
Munro, J., Navarro, G., Shah, R., & Thankachan, S. (2020). Ranked document selection. Theoretical Computer Science, 812, 149-159. https://doi.org/10.1016/j.tcs.2019.10.008