Document Type
Article
Publication Date
3-9-2023
Abstract
The ranked (or top-k) document retrieval problem is defined as follows: preprocess a collection {T1,T2,... ,Td} of d strings (called documents) of total length n into a data structure, such that for any given query (P,k), where P is a string (called pattern) of length p ≥ 1 and k ϵ [1,d] is an integer, the identifiers of those k documents that are most relevant to P can be reported, ideally in the sorted order of their relevance. The seminal work by Hon et al. [FOCS 2009 and Journal of the ACM 2014] presented an O(n)-space (in words) data structure with O(p+k log k) query time. The query time was later improved to O(p+k) [SODA 2012] and further to O(p/ log σn+k) [SIAM Journal on Computing 2017] by Navarro and Nekrich, where σ is the alphabet size. We revisit this problem in the external memory model and present three data structures. The first one takes O(n)-space and answer queries in O(p/B + log B n + k/B+ log ∗ (n/B)) I/Os, where B is the block size. The second one takes O(n log ∗ (n/B)) space and answer queries in optimal O(p/B + log B n + k/B) I/Os. In both cases, the answers are reported in the unsorted order of relevance. To handle sorted top-k document retrieval, we present an O(n log (d/B)) space data structure with optimal query cost.
Publication Source (Journal or Book title)
ACM Transactions on Algorithms
Recommended Citation
Shah, R., Sheng, C., Thankachan, S., & Vitter, J. (2023). Ranked Document Retrieval in External Memory. ACM Transactions on Algorithms, 19 (1) https://doi.org/10.1145/3559763