Fast Similarity Sketching

April 14, 2017 Β· Declared Dead Β· πŸ› IEEE Annual Symposium on Foundations of Computer Science

πŸ‘» CAUSE OF DEATH: Ghosted
No code link whatsoever

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Søren Dahlgaard, Mathias Bæk Tejs Langhede, Jakob Bæk Tejs Houen, Mikkel Thorup arXiv ID 1704.04370 Category cs.DS: Data Structures & Algorithms Citations 33 Venue IEEE Annual Symposium on Foundations of Computer Science Last Checked 3 months ago
Abstract
We consider the $\textit{Similarity Sketching}$ problem: Given a universe $[u] = \{0,\ldots, u-1\}$ we want a random function $S$ mapping subsets $A\subseteq [u]$ into vectors $S(A)$ of size $t$, such that the Jaccard similarity $J(A,B) = |A\cap B|/|A\cup B|$ between sets $A$ and $B$ is preserved. More precisely, define $X_i = [S(A)[i] = S(B)[i]]$ and $X = \sum_{i\in [t]} X_i$. We want $E[X_i]=J(A,B)$, and we want $X$ to be strongly concentrated around $E[X] = t \cdot J(A,B)$ (i.e. Chernoff-style bounds). This is a fundamental problem which has found numerous applications in data mining, large-scale classification, computer vision, similarity search, etc. via the classic MinHash algorithm. The vectors $S(A)$ are also called $\textit{sketches}$. Strong concentration is critical, for often we want to sketch many sets $B_1,\ldots,B_n$ so that we later, for a query set $A$, can find (one of) the most similar $B_i$. It is then critical that no $B_i$ looks much more similar to $A$ due to errors in the sketch. The seminal $t\times\textit{MinHash}$ algorithm uses $t$ random hash functions $h_1,\ldots, h_t$, and stores $\left ( \min_{a\in A} h_1(A),\ldots, \min_{a\in A} h_t(A) \right )$ as the sketch of $A$. The main drawback of MinHash is, however, its $O(t\cdot |A|)$ running time, and finding a sketch with similar properties and faster running time has been the subject of several papers. (continued...)
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

πŸ“œ Similar Papers

In the same crypt β€” Data Structures & Algorithms

Died the same way β€” πŸ‘» Ghosted