EntroGD: Scalable Generalized Deduplication for Efficient Direct Analytics on Compressed IoT Data
November 06, 2025 ยท Declared Dead ยท ๐ the IEEE INFOCOM 2026 Workshop on Fusion of Data
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Xiaobo Zhao, Daniel E. Lucani
arXiv ID
2511.04148
Category
cs.DB: Databases
Citations
0
Venue
the IEEE INFOCOM 2026 Workshop on Fusion of Data
Last Checked
3 months ago
Abstract
Massive data streams from IoT and cyber-physical systems must be processed under strict bandwidth, latency, and resource constraints. Generalized Deduplication (GD) is a promising lossless compression framework, as it supports random access and direct analytics on compressed data. However, existing GD algorithms exhibit quadratic complexity $\mathcal{O}(nd^{2})$, which limits their scalability for high-dimensional datasets. This paper proposes \textbf{EntroGD}, an entropy-guided GD framework that decouples analytical fidelity from compression efficiency to achieve linear complexity $\mathcal{O}(nd)$. EntroGD adopts a two-stage design, first constructing compact condensed samples to preserve information critical for analytics, and then applying entropy-based bit selection to maximize compression. Experiments on 18 IoT datasets show that EntroGD reduces configuration time by up to $53.5\times$ compared to state-of-the-art GD compressors. Moreover, by enabling analytics with access to only $2.6\%$ of the original data volume, EntroGD accelerates clustering by up to $31.6\times$ with negligible loss in accuracy. Overall, EntroGD provides a scalable and system-efficient solution for direct analytics on compressed IoT data.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Databases
R.I.P.
๐ป
Ghosted
R.I.P.
๐ป
Ghosted
The Case for Learned Index Structures
R.I.P.
๐ป
Ghosted
Untangling Blockchain: A Data Processing View of Blockchain Systems
R.I.P.
๐ป
Ghosted
Converting Static Image Datasets to Spiking Neuromorphic Datasets Using Saccades
R.I.P.
๐ป
Ghosted
BLOCKBENCH: A Framework for Analyzing Private Blockchains
R.I.P.
๐ป
Ghosted
Data Synthesis based on Generative Adversarial Networks
Died the same way โ ๐ป Ghosted
R.I.P.
๐ป
Ghosted
Language Models are Few-Shot Learners
R.I.P.
๐ป
Ghosted
PyTorch: An Imperative Style, High-Performance Deep Learning Library
R.I.P.
๐ป
Ghosted
XGBoost: A Scalable Tree Boosting System
R.I.P.
๐ป
Ghosted