Optimizing Adaptive Attacks against Watermarks for Language Models

October 03, 2024 ยท Declared Dead ยท ๐Ÿ› International Conference on Machine Learning

๐Ÿ“œ CAUSE OF DEATH: Death by README
Repo has only a README

Repo contents: LICENSE, README.md

Authors Abdulrahman Diaa, Toluwani Aremu, Nils Lukas arXiv ID 2410.02440 Category cs.CR: Cryptography & Security Cross-listed cs.AI Citations 4 Venue International Conference on Machine Learning Repository https://github.com/nilslukas/ada-wm-evasion โญ 3 Last Checked 1 month ago
Abstract
Large Language Models (LLMs) can be misused to spread unwanted content at scale. Content watermarking deters misuse by hiding messages in content, enabling its detection using a secret watermarking key. Robustness is a core security property, stating that evading detection requires (significant) degradation of the content's quality. Many LLM watermarking methods have been proposed, but robustness is tested only against non-adaptive attackers who lack knowledge of the watermarking method and can find only suboptimal attacks. We formulate watermark robustness as an objective function and use preference-based optimization to tune adaptive attacks against the specific watermarking method. Our evaluation shows that (i) adaptive attacks evade detection against all surveyed watermarks, (ii) training against any watermark succeeds in evading unseen watermarks, and (iii) optimization-based attacks are cost-effective. Our findings underscore the need to test robustness against adaptively tuned attacks. We release our adaptively optimized paraphrasers at https://github.com/nilslukas/ada-wm-evasion.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

๐Ÿ“œ Similar Papers

In the same crypt โ€” Cryptography & Security

Died the same way โ€” ๐Ÿ“œ Death by README