Information Theoretic Limits of Cardinality Estimation: Fisher Meets Shannon
July 16, 2020 ยท Declared Dead ยท ๐ Symposium on the Theory of Computing
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Seth Pettie, Dingyu Wang
arXiv ID
2007.08051
Category
cs.DS: Data Structures & Algorithms
Cross-listed
cs.IT
Citations
27
Venue
Symposium on the Theory of Computing
Last Checked
3 months ago
Abstract
Estimating the cardinality (number of distinct elements) of a large multiset is a classic problem in streaming and sketching. In this paper we study the intrinsic tradeoff between the space complexity of the sketch and its estimation error. We define a new measure of efficiency for data sketches called the Fisher-Shannon (FiSh) number $\mathcal{H}/\mathcal{I}$. It captures the tension between the limiting Shannon entropy ($\mathcal{H}$) of the sketch and its normalized Fisher information ($\mathcal{I}$) that characterizes the variance of a statistically efficient, asymptotically unbiased estimator. Our aim in introducing the FiSh-number is to build the mathematical machinery necessary to argue for precise optimality, rather than asymptotic optimality, up to large constant factors. Our results are as follows. [1] We prove that all base-$q$ variants of Flajolet and Martin's PCSA sketch have FiSh-number $H_0/I_0 \approx 1.98016$ and that every base-$q$ variant of HyperLogLog has FiSh-number worse than $H_0/I_0$, but that they tend to $H_0/I_0$ in the limit as $q\rightarrow \infty$. Here $H_0,I_0$ are precisely defined constants. [2] We describe a sketch called Fishmonger that is based on a smoothed, entropy-compressed variant of PCSA with a different estimator function. Fishmonger processes a multiset of $[U]$ such that at all times, w.h.p., its space is $(1+o(1))(H_0/I_0)m \approx 1.98m$ bits and its standard error is $1/\sqrt{m}$. For example, to achieve a 1% standard error, one needs a little more than 19,800 bits, or $\approx 2.42$ kilobytes. [3] Finally, we give circumstantial evidence that $H_0/I_0$ is the optimum FiSh-number of mergeable sketches for Cardinality Estimation. We define a natural subset of mergeable sketches called linearizable sketches and prove that no member of this class can beat $H_0/I_0$. The popular mergeable sketches are, in fact, also linearizable.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Data Structures & Algorithms
R.I.P.
๐ป
Ghosted
R.I.P.
๐ป
Ghosted
Relief-Based Feature Selection: Introduction and Review
R.I.P.
๐ป
Ghosted
Route Planning in Transportation Networks
R.I.P.
๐ป
Ghosted
Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration
R.I.P.
๐ป
Ghosted
Hierarchical Clustering: Objective Functions and Algorithms
R.I.P.
๐ป
Ghosted
Graph Isomorphism in Quasipolynomial Time
Died the same way โ ๐ป Ghosted
R.I.P.
๐ป
Ghosted
Language Models are Few-Shot Learners
R.I.P.
๐ป
Ghosted
PyTorch: An Imperative Style, High-Performance Deep Learning Library
R.I.P.
๐ป
Ghosted
XGBoost: A Scalable Tree Boosting System
R.I.P.
๐ป
Ghosted