Systematic Evaluation of Neural Retrieval Models on the TouchΓ© 2020 Argument Retrieval Subset of BEIR
July 10, 2024 Β· Declared Dead Β· π Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
Authors
Nandan Thakur, Luiz Bonifacio, Maik FrΓΆbe, Alexander Bondarenko, Ehsan Kamalloo, Martin Potthast, Matthias Hagen, Jimmy Lin
arXiv ID
2407.07790
Category
cs.IR: Information Retrieval
Citations
16
Venue
Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
Repository
https://github.com/castorini/touche-error-analysis}
Last Checked
1 month ago
Abstract
The zero-shot effectiveness of neural retrieval models is often evaluated on the BEIR benchmark -- a combination of different IR evaluation datasets. Interestingly, previous studies found that particularly on the BEIR subset TouchΓ© 2020, an argument retrieval task, neural retrieval models are considerably less effective than BM25. Still, so far, no further investigation has been conducted on what makes argument retrieval so "special". To more deeply analyze the respective potential limits of neural retrieval models, we run a reproducibility study on the TouchΓ© 2020 data. In our study, we focus on two experiments: (i) a black-box evaluation (i.e., no model retraining), incorporating a theoretical exploration using retrieval axioms, and (ii) a data denoising evaluation involving post-hoc relevance judgments. Our black-box evaluation reveals an inherent bias of neural models towards retrieving short passages from the TouchΓ© 2020 data, and we also find that quite a few of the neural models' results are unjudged in the TouchΓ© 2020 data. As many of the short TouchΓ© passages are not argumentative and thus non-relevant per se, and as the missing judgments complicate fair comparison, we denoise the TouchΓ© 2020 data by excluding very short passages (less than 20 words) and by augmenting the unjudged data with post-hoc judgments following the TouchΓ© guidelines. On the denoised data, the effectiveness of the neural models improves by up to 0.52 in nDCG@10, but BM25 is still more effective. Our code and the augmented TouchΓ© 2020 dataset are available at \url{https://github.com/castorini/touche-error-analysis}.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
π Similar Papers
In the same crypt β Information Retrieval
R.I.P.
π»
Ghosted
R.I.P.
π»
Ghosted
LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation
R.I.P.
π»
Ghosted
Graph Convolutional Neural Networks for Web-Scale Recommender Systems
π
π
Old Age
Neural Graph Collaborative Filtering
R.I.P.
π»
Ghosted
Self-Attentive Sequential Recommendation
R.I.P.
π»
Ghosted
DeepFM: A Factorization-Machine based Neural Network for CTR Prediction
Died the same way β π 404 Not Found
R.I.P.
π
404 Not Found
Deep High-Resolution Representation Learning for Visual Recognition
R.I.P.
π
404 Not Found
HuggingFace's Transformers: State-of-the-art Natural Language Processing
R.I.P.
π
404 Not Found
CCNet: Criss-Cross Attention for Semantic Segmentation
R.I.P.
π
404 Not Found