Faculty Publications

Inverted indexes for phrases and strings

Manish Patil, Louisiana State University
Sharma V. Thankachan, Louisiana State University
Rahul Shah, Louisiana State University
Wing Kai Hon, National Tsing Hua University
Jeffrey Scott Vitter, KU School of Engineering
Sabrina Chandrasekaran, Louisiana State University

Document Type

Conference Proceeding

Publication Date

1-1-2011

Abstract

Inverted indexes are the most fundamental and widely used data structures in information retrieval. For each unique word occurring in a document collection, the inverted index stores a list of the documents in which this word occurs. Compression techniques are often applied to further reduce the space requirement of these lists. However, the index has a shortcoming, in that only predefined pattern queries can be supported efficiently. In terms of string documents where word boundaries are undefined, if we have to index all the substrings of a given document, then the storage quickly becomes quadratic in the data size. Also, if we want to apply the same type of indexes for querying phrases or sequence of words, then the inverted index will end up storing redundant information. In this paper, we show the first set of inverted indexes which work naturally for strings as well as phrase searching. The central idea is to exclude document d in the inverted list of a string P if every occurrence of P in d is subsumed by another string of which P is a prefix. With this we show that our space utilization is close to the optimal. Techniques from succinct data structures are deployed to achieve compression while allowing fast access in terms of frequency and document id based retrieval. Compression and speed tradeoffs are evaluated for different variants of the proposed index. For phrase searching, we show that our indexes compare favorably against a typical inverted index deploying position-wise intersections. We also show efficient top-k based retrieval under relevance metrics like frequency and tf-idf.

Publication Source (Journal or Book title)

SIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval

First Page

555

Last Page

564

Recommended Citation

Patil, M., Thankachan, S., Shah, R., Hon, W., Vitter, J., & Chandrasekaran, S. (2011). Inverted indexes for phrases and strings. SIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, 555-564. https://doi.org/10.1145/2009916.2009992

Download

COinS

Faculty Publications

Inverted indexes for phrases and strings

Document Type

Publication Date

Abstract

Publication Source (Journal or Book title)

First Page

Last Page

Recommended Citation

Search

Browse

Author Corner

SPONSORED BY

Faculty Publications

Inverted indexes for phrases and strings

Authors

Document Type

Publication Date

Abstract

Publication Source (Journal or Book title)

First Page

Last Page

Recommended Citation

Share

Search

Browse

Author Corner

SPONSORED BY