Top-k Term-Proximity in Succinct Space

Document Type

Article

Publication Date

6-1-2017

Abstract

Let D= { T1, T2, … , TD} be a collection of D string documents of n characters in total, that are drawn from an alphabet set Σ= [ σ]. The top-k document retrieval problem is to preprocess D into a data structure that, given a query (P[ 1 … p] , k) , can return the k documents of D most relevant to the pattern P. The relevance is captured using a predefined ranking function, which depends on the set of occurrences of P in Td. For example, it can be the term frequency (i.e., the number of occurrences of P in Td), or it can be the term proximity (i.e., the distance between the closest pair of occurrences of P in Td), or a pattern-independent importance score of Td such as PageRank. Linear space and optimal query time solutions already exist for the general top-k document retrieval problem. Compressed and compact space solutions are also known, but only for a few ranking functions such as term frequency and importance. However, space efficient data structures for term proximity based retrieval have been evasive. In this paper we present the first sub-linear space data structure for this relevance function, which uses only o(n) bits on top of any compressed suffix array of D and solves queries in O((p+k)polylogn) time. We also show that scores that consist of a weighted combination of term proximity, term frequency, and document importance, can be handled using twice the space required to represent the text collection.

Publication Source (Journal or Book title)

Algorithmica

First Page

379

Last Page

393

This document is currently not available here.

Share

COinS