Aggregation Consistency Errors in Semantic Layers and How to Avoid Them

July 01, 2023 ยท Declared Dead ยท ๐Ÿ› HILDA@SIGMOD

๐Ÿ‘ป CAUSE OF DEATH: Ghosted
No code link whatsoever

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Zezhou Huang, Pavan Kalyan Damalapati, Eugene Wu arXiv ID 2307.00417 Category cs.DB: Databases Cross-listed cs.HC Citations 3 Venue HILDA@SIGMOD Last Checked 3 months ago
Abstract
Analysts often struggle with analyzing data from multiple tables in a database due to their lack of knowledge on how to join and aggregate the data. To address this, data engineers pre-specify "semantic layers" which include the join conditions and "metrics" of interest with aggregation functions and expressions. However, joins can cause "aggregation consistency issues". For example, analysts may observe inflated total revenue caused by double counting from join fanouts. Existing BI tools rely on heuristics for deduplication, resulting in imprecise and challenging-to-understand outcomes. To overcome these challenges, we propose "weighing" as a core primitive to counteract join fanouts. "Weighing" has been used in various areas, such as market attribution and order management, ensuring metrics consistency (e.g., total revenue remains the same) even for many-to-many joins. The idea is to assign equal weight to each join key group (rather than each tuple) and then distribute the weights among tuples. Implementing weighing techniques necessitates user input; therefore, we recommend a human-in-the-loop framework that enables users to iteratively explore different strategies and visualize the results.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

๐Ÿ“œ Similar Papers

In the same crypt โ€” Databases

R.I.P. ๐Ÿ‘ป Ghosted

Datasheets for Datasets

Timnit Gebru, Jamie Morgenstern, ... (+5 more)

cs.DB ๐Ÿ› CACM ๐Ÿ“š 2.6K cites 8 years ago

Died the same way โ€” ๐Ÿ‘ป Ghosted