Fast, Small, and Simple Document Listing on Repetitive Text Collections

February 20, 2019 Β· Declared Dead Β· πŸ› SPIRE

πŸ‘» CAUSE OF DEATH: Ghosted
No code link whatsoever

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Dustin Cobas, Gonzalo Navarro arXiv ID 1902.07599 Category cs.DS: Data Structures & Algorithms Cross-listed cs.IR, cs.IT Citations 10 Venue SPIRE Last Checked 4 months ago
Abstract
Document listing on string collections is the task of finding all documents where a pattern appears. It is regarded as the most fundamental document retrieval problem, and is useful in various applications. Many of the fastest-growing string collections are composed of very similar documents, such as versioned code and document collections, genome repositories, etc. Plain pattern-matching indexes designed for repetitive text collections achieve orders-of-magnitude reductions in space. Instead, there are not many analogous indexes for document retrieval. In this paper we present a simple document listing index for repetitive string collections of total length $n$ that lists the $ndoc$ distinct documents where a pattern of length $m$ appears in time $\mathcal{O}(m+ndoc \cdot \log n)$. We exploit the repetitiveness of the document array (i.e., the suffix array coarsened to document identifiers) to grammar-compress it while precomputing the answers to nonterminals, and store them in grammar-compressed form as well. Our experimental results show that our index sharply outperforms existing alternatives in the space/time tradeoff map.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

πŸ“œ Similar Papers

In the same crypt β€” Data Structures & Algorithms

Died the same way β€” πŸ‘» Ghosted