Learning Over Dirty Data Without Cleaning
April 05, 2020 ยท Declared Dead ยท ๐ SIGMOD Conference
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Jose Picado, John Davis, Arash Termehchy, Ga Young Lee
arXiv ID
2004.02308
Category
cs.DB: Databases
Cross-listed
cs.LG
Citations
25
Venue
SIGMOD Conference
Last Checked
3 months ago
Abstract
Real-world datasets are dirty and contain many errors. Examples of these issues are violations of integrity constraints, duplicates, and inconsistencies in representing data values and entities. Learning over dirty databases may result in inaccurate models. Users have to spend a great deal of time and effort to repair data errors and create a clean database for learning. Moreover, as the information required to repair these errors is not often available, there may be numerous possible clean versions for a dirty database. We propose DLearn, a novel relational learning system that learns directly over dirty databases effectively and efficiently without any preprocessing. DLearn leverages database constraints to learn accurate relational models over inconsistent and heterogeneous data. Its learned models represent patterns over all possible clean instances of the data in a usable form. Our empirical study indicates that DLearn learns accurate models over large real-world databases efficiently.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Databases
R.I.P.
๐ป
Ghosted
R.I.P.
๐ป
Ghosted
The Case for Learned Index Structures
R.I.P.
๐ป
Ghosted
Untangling Blockchain: A Data Processing View of Blockchain Systems
R.I.P.
๐ป
Ghosted
Converting Static Image Datasets to Spiking Neuromorphic Datasets Using Saccades
R.I.P.
๐ป
Ghosted
BLOCKBENCH: A Framework for Analyzing Private Blockchains
R.I.P.
๐ป
Ghosted
Data Synthesis based on Generative Adversarial Networks
Died the same way โ ๐ป Ghosted
R.I.P.
๐ป
Ghosted
Language Models are Few-Shot Learners
R.I.P.
๐ป
Ghosted
PyTorch: An Imperative Style, High-Performance Deep Learning Library
R.I.P.
๐ป
Ghosted
XGBoost: A Scalable Tree Boosting System
R.I.P.
๐ป
Ghosted