Faculty Publications

Efficient index for retrieving top-k most frequent documents

Wing Kai Hon, National Tsing Hua University
Rahul Shah, LSU College of Engineering
Shih Bin Wu, National Tsing Hua University

Document Type

Conference Proceeding

Publication Date

11-9-2009

Abstract

In the document retrieval problem [9], we are given a collection of documents (strings) of total length D in advance, and our target is to create an index for these documents such that for any subsequent input pattern P, we can identify which documents in the collection contain P. In this paper, we study a natural extension to the above document retrieval problem. We call this top-k frequent document retrieval, where instead of listing all documents containing P, our focus is to identify the top k documents having most occurrences of P. This problem forms a basis for search engine tasks of retrieving documents ranked with TFIDF metric. A related problem was studied by [9] where the emphasis was on retrieving all the documents whose number of occurrences of the pattern P exceeds some frequency threshold f. However, from the information retrieval point of view, it is hard for a user to specify such a threshold value f and have a sense of how many documents will be outputted. We develop some additional building blocks which help the user overcome this limitation. These are used to derive an efficient index for top-k frequent document retrieval problem, answering queries in O(P + logD loglogD + k) time and taking O(DlogD) space. Our approach is based on novel use of the suffix tree called induced generalized suffix tree (IGST). © 2009 Springer.

Publication Source (Journal or Book title)

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

First Page

182

Last Page

193

Recommended Citation

Hon, W., Shah, R., & Wu, S. (2009). Efficient index for retrieving top-k most frequent documents. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 5721 LNCS, 182-193. https://doi.org/10.1007/978-3-642-03784-9_18

This document is currently not available here.

COinS

Faculty Publications

Efficient index for retrieving top-k most frequent documents

Document Type

Publication Date

Abstract

Publication Source (Journal or Book title)

First Page

Last Page

Recommended Citation

Search

Browse

Author Corner

SPONSORED BY

Faculty Publications

Efficient index for retrieving top-k most frequent documents

Authors

Document Type

Publication Date

Abstract

Publication Source (Journal or Book title)

First Page

Last Page

Recommended Citation

Share

Search

Browse

Author Corner

SPONSORED BY