SilkMoth: An Efficient Method for Finding Related Sets with Maximum Matching Constraints
April 16, 2017 Β· Declared Dead Β· π Proceedings of the VLDB Endowment
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Dong Deng, Albert Kim, Samuel Madden, Michael Stonebraker
arXiv ID
1704.04738
Category
cs.DB: Databases
Citations
43
Venue
Proceedings of the VLDB Endowment
Last Checked
3 months ago
Abstract
Determining if two sets are related - that is, if they have similar values or if one set contains the other - is an important problem with many applications in data cleaning, data integration, and information retrieval. A particularly popular metric that has been proposed is to measure the relatedness of two sets by treating the elements as vertices of a bipartite graph and calculating the score of the maximum matching pairing between elements. Compared to other metrics which require exact matchings between elements, this metric uses a similarity function to compare elements between the two sets, making it robust to small dissimilarities in elements and more useful for real-world, dirty data. Unfortunately, the metric suffers from expensive computational cost, taking O(n^3) time, where n is the number of elements in sets, for each set-to-set comparison. Thus for applications which try to search for all pairings of related sets in a brute-force manner, the runtime becomes unacceptably large. To address this challenge, we developed SilkMoth, a system capable of rapidly discovering related set pairs in collections of sets. Internally, SilkMoth creates a signature for each set, with the property that any other set which is related must match the signature. SilkMoth then uses these signatures to prune the search space, so only sets which match the signatures are left as candidates. Finally, SilkMoth applies the maximum matching metric on remaining candidates to verify which of these candidates are truly related sets. Thus, a contribution of this paper is the characterization of the space of signatures which enable this property. We show that selecting the optimal signature in this space is NP-complete, and based on insights from the characterization of the space, we propose two novel filters which help to prune the candidates further before verification.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
π Similar Papers
In the same crypt β Databases
R.I.P.
π»
Ghosted
R.I.P.
π»
Ghosted
Untangling Blockchain: A Data Processing View of Blockchain Systems
R.I.P.
π»
Ghosted
Converting Static Image Datasets to Spiking Neuromorphic Datasets Using Saccades
R.I.P.
π»
Ghosted
BLOCKBENCH: A Framework for Analyzing Private Blockchains
R.I.P.
π»
Ghosted
Data Synthesis based on Generative Adversarial Networks
R.I.P.
π»
Ghosted
HoloClean: Holistic Data Repairs with Probabilistic Inference
Died the same way β π» Ghosted
R.I.P.
π»
Ghosted
Federated Learning: Strategies for Improving Communication Efficiency
R.I.P.
π»
Ghosted
In-Datacenter Performance Analysis of a Tensor Processing Unit
R.I.P.
π»
Ghosted
Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning
R.I.P.
π»
Ghosted