Robust Detection of Watermarks for Large Language Models Under Human Edits

November 21, 2024 · Declared Dead · 🏛 Journal of The Royal Statistical Society Series B-statistical Methodology

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Xiang Li, Feng Ruan, Huiyuan Wang, Qi Long, Weijie J. Su arXiv ID 2411.13868 Category stat.ME Cross-listed cs.CL, cs.LG, math.ST, stat.ML Citations 18 Venue Journal of The Royal Statistical Society Series B-statistical Methodology Last Checked 1 month ago

Abstract

Watermarking has offered an effective approach to distinguishing text generated by large language models (LLMs) from human-written text. However, the pervasive presence of human edits on LLM-generated text dilutes watermark signals, thereby significantly degrading detection performance of existing methods. In this paper, by modeling human edits through mixture model detection, we introduce a new method in the form of a truncated goodness-of-fit test for detecting watermarked text under human edits, which we refer to as Tr-GoF. We prove that the Tr-GoF test achieves optimality in robust detection of the Gumbel-max watermark in a certain asymptotic regime of substantial text modifications and vanishing watermark signals. Importantly, Tr-GoF achieves this optimality \textit{adaptively} as it does not require precise knowledge of human edit levels or probabilistic specifications of the LLMs, in contrast to the optimal but impractical (Neyman--Pearson) likelihood ratio test. Moreover, we establish that the Tr-GoF test attains the highest detection efficiency rate in a certain regime of moderate text modifications. In stark contrast, we show that sum-based detection rules, as employed by existing methods, fail to achieve optimal robustness in both regimes because the additive nature of their statistics is less resilient to edit-induced noise. Finally, we demonstrate the competitive and sometimes superior empirical performance of the Tr-GoF test on both synthetic data and open-source LLMs in the OPT and LLaMA families.