Space-efficient detection of unusual words

August 12, 2015 Β· Declared Dead Β· πŸ› SPIRE

πŸ‘» CAUSE OF DEATH: Ghosted
No code link whatsoever

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Djamal Belazzougui, Fabio Cunial arXiv ID 1508.02968 Category cs.DS: Data Structures & Algorithms Citations 13 Venue SPIRE Last Checked 3 months ago
Abstract
Detecting all the strings that occur in a text more frequently or less frequently than expected according to an IID or a Markov model is a basic problem in string mining, yet current algorithms are based on data structures that are either space-inefficient or incur large slowdowns, and current implementations cannot scale to genomes or metagenomes in practice. In this paper we engineer an algorithm based on the suffix tree of a string to use just a small data structure built on the Burrows-Wheeler transform, and a stack of $O(Οƒ^2\log^2 n)$ bits, where $n$ is the length of the string and $Οƒ$ is the size of the alphabet. The size of the stack is $o(n)$ except for very large values of $Οƒ$. We further improve the algorithm by removing its time dependency on $Οƒ$, by reporting only a subset of the maximal repeats and of the minimal rare words of the string, and by detecting and scoring candidate under-represented strings that $\textit{do not occur}$ in the string. Our algorithms are practical and work directly on the BWT, thus they can be immediately applied to a number of existing datasets that are available in this form, returning this string mining problem to a manageable scale.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

πŸ“œ Similar Papers

In the same crypt β€” Data Structures & Algorithms

Died the same way β€” πŸ‘» Ghosted