Faculty Publications

I/O-efficient compressed text indexes: From theory to practice

Sheng Yuan Chiu, National Tsing Hua University
Wing Kai Hon, National Tsing Hua University
Rahul Shah, LSU College of Engineering
Jeffrey Scott Vitter, College of Engineering

Document Type

Conference Proceeding

Publication Date

6-1-2010

Abstract

Pattern matching on text data has been a fundamental field of Computer Science for nearly 40 years. Databases supporting full-text indexing functionality on text data are now widely used by biologists. In the theoretical literature, the most popular internal-memory index structures are the suffix trees and the suffix arrays, and the most popular external-memory index structure is the string B-tree. However, the practical applicability of these indexes has been limited mainly because of their space consumption and I/O issues. These structures use a lot more space (almost 20 to 50 times more) than the original text data and are often disk-resident. Ferragina and Manzini (2005) and Grossi and Vitter (2005) gave the first compressed text indexes with efficient query times in the internal-memory model. Recently, Chien et al (2008) presented a compact text index in the external memory based on the concept of Geometric Burrows-Wheeler Transform. They also presented lower bounds which suggested that it may be hard to obtain a good index structure in the external memory. In this paper, we investigate this issue from a practical point of view. On the positive side we show an external-memory text indexing structure (based on R-trees and KD-trees) that saves space by about an order of magnitude as compared to the standard String B-tree. While saving space, these structures also maintain a comparable I/O efficiency to that of String B-tree. We also show various space vs I/O efficiency trade-offs for our structures. © 2010 IEEE.

Publication Source (Journal or Book title)

Data Compression Conference Proceedings

First Page

426

Last Page

434

Recommended Citation

Chiu, S., Hon, W., Shah, R., & Vitter, J. (2010). I/O-efficient compressed text indexes: From theory to practice. Data Compression Conference Proceedings, 426-434. https://doi.org/10.1109/DCC.2010.45

Download

COinS

Faculty Publications

I/O-efficient compressed text indexes: From theory to practice

Document Type

Publication Date

Abstract

Publication Source (Journal or Book title)

First Page

Last Page

Recommended Citation

Search

Browse

Author Corner

SPONSORED BY

Faculty Publications

I/O-efficient compressed text indexes: From theory to practice

Authors

Document Type

Publication Date

Abstract

Publication Source (Journal or Book title)

First Page

Last Page

Recommended Citation

Share

Search

Browse

Author Corner

SPONSORED BY