Document Type

Article

Publication Date

4-6-2020

Abstract

Let D be a collection of string documents of n characters in total. The top-k document retrieval problem is to preprocess D into a data structure that, given a query (P,k), can return the k documents of D most relevant to pattern P. The relevance of a document d for a pattern P is given by a predefined ranking function w(P,d). Linear space and optimal query time solutions already exist for this problem. In this paper we consider a novel problem, document selection, in which a query (P,k) aims to report the kth document most relevant to P (instead of reporting all top-k documents). We present a data structure using O(nlogϵ⁡n) space, for any constant ϵ>0, answering selection queries in time O(log⁡k/log⁡log⁡n), and a linear-space data structure answering queries in time O(log⁡k), given the locus node of P in a (generalized) suffix tree of D. We also prove that it is unlikely that a succinct-space solution for this problem exists with poly-logarithmic query time, and that O(log⁡k/log⁡log⁡n) is indeed optimal within O(npolylogn) space for most text families. Finally, we present some additional space-time trade-offs exploring the extremes of those lower bounds.

Publication Source (Journal or Book title)

Theoretical Computer Science

First Page

149

Last Page

159

Share

COinS