Document Type

Article

Publication Date

3-9-2023

Abstract

The ranked (or top-k) document retrieval problem is defined as follows: preprocess a collection {T1,T2,... ,Td} of d strings (called documents) of total length n into a data structure, such that for any given query (P,k), where P is a string (called pattern) of length p ≥ 1 and k ϵ [1,d] is an integer, the identifiers of those k documents that are most relevant to P can be reported, ideally in the sorted order of their relevance. The seminal work by Hon et al. [FOCS 2009 and Journal of the ACM 2014] presented an O(n)-space (in words) data structure with O(p+k log k) query time. The query time was later improved to O(p+k) [SODA 2012] and further to O(p/ log σn+k) [SIAM Journal on Computing 2017] by Navarro and Nekrich, where σ is the alphabet size. We revisit this problem in the external memory model and present three data structures. The first one takes O(n)-space and answer queries in O(p/B + log B n + k/B+ log ∗ (n/B)) I/Os, where B is the block size. The second one takes O(n log ∗ (n/B)) space and answer queries in optimal O(p/B + log B n + k/B) I/Os. In both cases, the answers are reported in the unsorted order of relevance. To handle sorted top-k document retrieval, we present an O(n log (d/B)) space data structure with optimal query cost.

Publication Source (Journal or Book title)

ACM Transactions on Algorithms

Plum Print visual indicator of research metrics
PlumX Metrics
  • Usage
    • Abstract Views: 2
    • Downloads: 1
  • Captures
    • Readers: 2
see details

Share

COinS