Faculty Publications

Top-k Term-Proximity in Succinct Space

J. Ian Munro, University of Waterloo
Gonzalo Navarro, Universidad de Chile
Jesper Sindahl Nielsen, Aarhus Universitet
Rahul Shah, LSU College of Engineering
Sharma V. Thankachan, Georgia Institute of Technology

Document Type

Article

Publication Date

6-1-2017

Abstract

Let D= { T1, T2, … , TD} be a collection of D string documents of n characters in total, that are drawn from an alphabet set Σ= [ σ]. The top-k document retrieval problem is to preprocess D into a data structure that, given a query (P[ 1 … p] , k) , can return the k documents of D most relevant to the pattern P. The relevance is captured using a predefined ranking function, which depends on the set of occurrences of P in Td. For example, it can be the term frequency (i.e., the number of occurrences of P in Td), or it can be the term proximity (i.e., the distance between the closest pair of occurrences of P in Td), or a pattern-independent importance score of Td such as PageRank. Linear space and optimal query time solutions already exist for the general top-k document retrieval problem. Compressed and compact space solutions are also known, but only for a few ranking functions such as term frequency and importance. However, space efficient data structures for term proximity based retrieval have been evasive. In this paper we present the first sub-linear space data structure for this relevance function, which uses only o(n) bits on top of any compressed suffix array of D and solves queries in O((p+k)polylogn) time. We also show that scores that consist of a weighted combination of term proximity, term frequency, and document importance, can be handled using twice the space required to represent the text collection.

Publication Source (Journal or Book title)

Algorithmica

First Page

379

Last Page

393

Recommended Citation

Munro, J., Navarro, G., Nielsen, J., Shah, R., & Thankachan, S. (2017). Top-k Term-Proximity in Succinct Space. Algorithmica, 78 (2), 379-393. https://doi.org/10.1007/s00453-016-0167-2

This document is currently not available here.

COinS

Faculty Publications

Top-k Term-Proximity in Succinct Space

Document Type

Publication Date

Abstract

Publication Source (Journal or Book title)

First Page

Last Page

Recommended Citation

Search

Browse

Author Corner

SPONSORED BY

Faculty Publications

Top-k Term-Proximity in Succinct Space

Authors

Document Type

Publication Date

Abstract

Publication Source (Journal or Book title)

First Page

Last Page

Recommended Citation

Share

Search

Browse

Author Corner

SPONSORED BY