CERES: Distantly Supervised Relation Extraction from the Semi-Structured Web

April 12, 2018 · Declared Dead · 🏛 Proceedings of the VLDB Endowment

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Colin Lockard, Xin Luna Dong, Arash Einolghozati, Prashant Shiralkar arXiv ID 1804.04635 Category cs.AI: Artificial Intelligence Cross-listed cs.IR Citations 68 Venue Proceedings of the VLDB Endowment Last Checked 3 months ago

Abstract

The web contains countless semi-structured websites, which can be a rich source of information for populating knowledge bases. Existing methods for extracting relations from the DOM trees of semi-structured webpages can achieve high precision and recall only when manual annotations for each website are available. Although there have been efforts to learn extractors from automatically-generated labels, these methods are not sufficiently robust to succeed in settings with complex schemas and information-rich websites. In this paper we present a new method for automatic extraction from semi-structured websites based on distant supervision. We automatically generate training labels by aligning an existing knowledge base with a web page and leveraging the unique structural characteristics of semi-structured websites. We then train a classifier based on the potentially noisy and incomplete labels to predict new relation instances. Our method can compete with annotation-based techniques in the literature in terms of extraction quality. A large-scale experiment on over 400,000 pages from dozens of multi-lingual long-tail websites harvested 1.25 million facts at a precision of 90%.