Fast and Exact Similarity Search in less than a Blink of an Eye

November 26, 2024 · Declared Dead · 🏛 IEEE International Conference on Data Engineering

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Patrick Schäfer, Jakob Brand, Ulf Leser, Botao Peng, Themis Palpanas arXiv ID 2411.17483 Category cs.DB: Databases Citations 3 Venue IEEE International Conference on Data Engineering Last Checked 3 months ago

Abstract

Similarity search is a fundamental operation for analyzing data series (DS), which are ordered sequences of real values. To enhance efficiency, summarization techniques are employed that reduce the dimensionality of DS. SAX-based approaches are the state-of-the-art for exact similarity queries, but their performance degrades for high-frequency signals, such as noisy data, or for high-frequency DS. In this work, we present the SymbOlic Fourier Approximation index (SOFA), which implements fast, exact similarity queries. SOFA is based on two building blocks: a tree index (inspired by MESSI) and the SFA symbolic summarization. It makes use of a learned summarization method called Symbolic Fourier Approximation (SFA), which is based on the Fourier transform and utilizes a data-adaptive quantization of the frequency domain. To better capture relevant information in high-frequency signals, SFA selects the Fourier coefficients by highest variance, resulting in a larger value range, thus larger quantization bins. The tree index solution employed by SOFA makes use of the GEMINI-approach to answer exact similarity search queries using lower bounding distance measures, and an efficient SIMD implementation. We further propose a novel benchmark comprising $17$ diverse datasets, encompassing 1 billion DS. Our experimental results demonstrate that SOFA outperforms existing methods on exact similarity queries: it is up to 10 times faster than a parallel sequential scan, 3-4 times faster than FAISS, and 2 times faster on average than MESSI. For high-frequency datasets, we observe a remarkable 38-fold performance improvement.