Effective and Complete Discovery of Order Dependencies via Set-based Axiomatization

August 22, 2016 · Declared Dead · 🏛 Proceedings of the VLDB Endowment

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Jaroslaw Szlichta, Parke Godfrey, Lukasz Golab, Mehdi Kargar, Divesh Srivastava arXiv ID 1608.06169 Category cs.DB: Databases Citations 65 Venue Proceedings of the VLDB Endowment Last Checked 3 months ago

Abstract

Integrity constraints (ICs) provide a valuable tool for expressing and enforcing application semantics. However, formulating constraints manually requires domain expertise, is prone to human errors, and may be excessively time consuming, especially on large datasets. Hence, proposals for automatic discovery have been made for some classes of ICs, such as functional dependencies (FDs), and recently, order dependencies (ODs). ODs properly subsume FDs, as they can additionally express business rules involving order; e.g., an employee never has a higher salary while paying lower taxes compared with another employee. We address the limitations of prior work on OD discovery which has factorial complexity in the number of attributes, is incomplete (i.e., it does not discover valid ODs that cannot be inferred from the ones found) and is not concise (i.e., it can result in "redundant" discovery and overly large discovery sets). We improve significantly on complexity, offer completeness, and define a compact canonical form. This is based on a novel polynomial mapping to a canonical form for ODs, and a sound and complete set of axioms (inference rules) for canonical ODs. This allows us to develop an efficient set-containment, lattice-driven OD discovery algorithm that uses the inference rules to prune the search space. Our algorithm has exponential worst-case time complexity in the number of attributes and linear complexity in the number of tuples. We prove that it produces a complete, minimal set of ODs (i.e., minimal with regards to the canonical representation). Finally, using real and synthetic datasets, we experimentally show orders-of-magnitude performance improvements over the current state-of-the-art algorithm and demonstrate effectiveness of our techniques.