Data Cleaning Using Large Language Models
October 21, 2024 ยท Declared Dead ยท ๐ 2025 IEEE 41st International Conference on Data Engineering Workshops (ICDEW)
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Shuo Zhang, Zezhou Huang, Eugene Wu
arXiv ID
2410.15547
Category
cs.DB: Databases
Citations
10
Venue
2025 IEEE 41st International Conference on Data Engineering Workshops (ICDEW)
Last Checked
3 months ago
Abstract
Data cleaning is a crucial yet challenging task in data analysis, often requiring significant manual effort. To automate data cleaning, previous systems have relied on statistical rules derived from erroneous data, resulting in low accuracy and recall. This work introduces Cocoon, a novel data cleaning system that leverages large language models for rules based on semantic understanding and combines them with statistical error detection. However, data cleaning is still too complex a task for current LLMs to handle in one shot. To address this, we introduce Cocoon, which decomposes complex cleaning tasks into manageable components in a workflow that mimics human cleaning processes. Our experiments show that Cocoon outperforms state-of-the-art data cleaning systems on standard benchmarks.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Databases
R.I.P.
๐ป
Ghosted
R.I.P.
๐ป
Ghosted
The Case for Learned Index Structures
R.I.P.
๐ป
Ghosted
Untangling Blockchain: A Data Processing View of Blockchain Systems
R.I.P.
๐ป
Ghosted
Converting Static Image Datasets to Spiking Neuromorphic Datasets Using Saccades
R.I.P.
๐ป
Ghosted
BLOCKBENCH: A Framework for Analyzing Private Blockchains
R.I.P.
๐ป
Ghosted
Data Synthesis based on Generative Adversarial Networks
Died the same way โ ๐ป Ghosted
R.I.P.
๐ป
Ghosted
Language Models are Few-Shot Learners
R.I.P.
๐ป
Ghosted
PyTorch: An Imperative Style, High-Performance Deep Learning Library
R.I.P.
๐ป
Ghosted
XGBoost: A Scalable Tree Boosting System
R.I.P.
๐ป
Ghosted