ManufactuBERT: Efficient Continual Pretraining for Manufacturing

November 07, 2025 Β· Declared Dead Β· πŸ› arXiv.org

πŸ’€ CAUSE OF DEATH: 404 Not Found
Code link is broken/dead
Authors Robin Armingaud, Romaric BesanΓ§on arXiv ID 2511.05135 Category cs.CL: Computation & Language Citations 0 Venue arXiv.org Repository https://huggingface.co/cea-list-ia Last Checked 2 months ago
Abstract
While large general-purpose Transformer-based encoders excel at general language understanding, their performance diminishes in specialized domains like manufacturing due to a lack of exposure to domain-specific terminology and semantics. In this paper, we address this gap by introducing ManufactuBERT, a RoBERTa model continually pretrained on a large-scale corpus curated for the manufacturing domain. We present a comprehensive data processing pipeline to create this corpus from web data, involving an initial domain-specific filtering step followed by a multi-stage deduplication process that removes redundancies. Our experiments show that ManufactuBERT establishes a new state-of-the-art on a range of manufacturing-related NLP tasks, outperforming strong specialized baselines. More importantly, we demonstrate that training on our carefully deduplicated corpus significantly accelerates convergence, leading to a 33\% reduction in training time and computational cost compared to training on the non-deduplicated dataset. The proposed pipeline offers a reproducible example for developing high-performing encoders in other specialized domains. We will release our model and curated corpus at https://huggingface.co/cea-list-ia.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

πŸ“œ Similar Papers

In the same crypt β€” Computation & Language

πŸŒ… πŸŒ… Old Age

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, ... (+6 more)

cs.CL πŸ› NeurIPS πŸ“š 166.0K cites 8 years ago

Died the same way β€” πŸ’€ 404 Not Found