Similarity joins for uncertain strings
Document Type
Conference Proceeding
Publication Date
1-1-2014
Abstract
A string similarity join finds all similar string pairs between two input string collections. It is an essential operation in many applications, such as data integration and cleaning, and has been extensively studied for deterministic strings. Increasingly, many applications have to deal with imprecise strings or strings with fuzzy information in them. This work presents the first solution for answering similarity join queries over uncertain strings that implements possible-world semantics, using the edit distance as the measure of similarity. Given two collections of uncertain strings R, S, and input (κ,τ ), our task is to find string pairs (R, S) between collections such that Pr(ed(R, S) ≤ κ) > τ i.e., the probability of the edit distance between R and S being at most k is more than probability threshold τ. We can address the join problem by obtaining all strings in S that are similar to each string R in R. However, existing solutions for answering such similarity search queries on uncertain string databases only support a deterministic string as input. Exploiting these solutions would require exponentially many possible worlds of R to be considered, which is not only ineffective but also prohibitively expensive. We propose various filtering techniques that give upper and (or) lower bound on Pr(ed(R, S) ≤ κ) without instantiating possible worlds for either of the strings. We then incorporate these techniques into an indexing scheme and significantly reduce the filtering overhead. Further, we alleviate the verification cost of a string pair that survives pruning by using a trie structure which allows us to overlap the verification cost of exponentially many possible instances of the candidate string pair. Finally, we evaluate the effectiveness of the proposed approach by thorough practical experimentation. © 2014 ACM.
Publication Source (Journal or Book title)
Proceedings of the ACM SIGMOD International Conference on Management of Data
First Page
1471
Last Page
1482
Recommended Citation
Patil, M., & Shah, R. (2014). Similarity joins for uncertain strings. Proceedings of the ACM SIGMOD International Conference on Management of Data, 1471-1482. https://doi.org/10.1145/2588555.2612178