Pre-trained Embeddings for Entity Resolution: An Experimental Analysis [Experiment, Analysis & Benchmark]
April 24, 2023 ยท Declared Dead ยท ๐ Proceedings of the VLDB Endowment
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Alexandros Zeakis, George Papadakis, Dimitrios Skoutas, Manolis Koubarakis
arXiv ID
2304.12329
Category
cs.DB: Databases
Cross-listed
cs.CL
Citations
61
Venue
Proceedings of the VLDB Endowment
Last Checked
3 months ago
Abstract
Many recent works on Entity Resolution (ER) leverage Deep Learning techniques involving language models to improve effectiveness. This is applied to both main steps of ER, i.e., blocking and matching. Several pre-trained embeddings have been tested, with the most popular ones being fastText and variants of the BERT model. However, there is no detailed analysis of their pros and cons. To cover this gap, we perform a thorough experimental analysis of 12 popular language models over 17 established benchmark datasets. First, we assess their vectorization overhead for converting all input entities into dense embeddings vectors. Second, we investigate their blocking performance, performing a detailed scalability analysis, and comparing them with the state-of-the-art deep learning-based blocking method. Third, we conclude with their relative performance for both supervised and unsupervised matching. Our experimental results provide novel insights into the strengths and weaknesses of the main language models, facilitating researchers and practitioners to select the most suitable ones in practice.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Databases
R.I.P.
๐ป
Ghosted
R.I.P.
๐ป
Ghosted
The Case for Learned Index Structures
R.I.P.
๐ป
Ghosted
Untangling Blockchain: A Data Processing View of Blockchain Systems
R.I.P.
๐ป
Ghosted
Converting Static Image Datasets to Spiking Neuromorphic Datasets Using Saccades
R.I.P.
๐ป
Ghosted
BLOCKBENCH: A Framework for Analyzing Private Blockchains
R.I.P.
๐ป
Ghosted
Data Synthesis based on Generative Adversarial Networks
Died the same way โ ๐ป Ghosted
R.I.P.
๐ป
Ghosted
Language Models are Few-Shot Learners
R.I.P.
๐ป
Ghosted
PyTorch: An Imperative Style, High-Performance Deep Learning Library
R.I.P.
๐ป
Ghosted
XGBoost: A Scalable Tree Boosting System
R.I.P.
๐ป
Ghosted