Exploring Backdoor Vulnerabilities of Chat Models
April 03, 2024 ยท Entered Twilight ยท ๐ International Conference on Computational Linguistics
Repo contents: .gitignore, Instructional_Model_Backdoor, LICENSE, README.md, assets, configs, docker, docs, fastchat, format.sh, img, playground, pyproject.toml, requirements.txt, scripts, tests
Authors
Yunzhuo Hao, Wenkai Yang, Yankai Lin
arXiv ID
2404.02406
Category
cs.CR: Cryptography & Security
Cross-listed
cs.AI,
cs.CL
Citations
17
Venue
International Conference on Computational Linguistics
Repository
https://github.com/hychaochao/Chat-Models-Backdoor-Attacking
โญ 18
Last Checked
1 month ago
Abstract
Recent researches have shown that Large Language Models (LLMs) are susceptible to a security threat known as Backdoor Attack. The backdoored model will behave well in normal cases but exhibit malicious behaviours on inputs inserted with a specific backdoor trigger. Current backdoor studies on LLMs predominantly focus on instruction-tuned LLMs, while neglecting another realistic scenario where LLMs are fine-tuned on multi-turn conversational data to be chat models. Chat models are extensively adopted across various real-world scenarios, thus the security of chat models deserves increasing attention. Unfortunately, we point out that the flexible multi-turn interaction format instead increases the flexibility of trigger designs and amplifies the vulnerability of chat models to backdoor attacks. In this work, we reveal and achieve a novel backdoor attacking method on chat models by distributing multiple trigger scenarios across user inputs in different rounds, and making the backdoor be triggered only when all trigger scenarios have appeared in the historical conversations. Experimental results demonstrate that our method can achieve high attack success rates (e.g., over 90% ASR on Vicuna-7B) while successfully maintaining the normal capabilities of chat models on providing helpful responses to benign user requests. Also, the backdoor can not be easily removed by the downstream re-alignment, highlighting the importance of continued research and attention to the security concerns of chat models. Warning: This paper may contain toxic content.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Cryptography & Security
R.I.P.
๐ป
Ghosted
R.I.P.
๐ป
Ghosted
Membership Inference Attacks against Machine Learning Models
R.I.P.
๐ป
Ghosted
The Limitations of Deep Learning in Adversarial Settings
R.I.P.
๐ป
Ghosted
Practical Black-Box Attacks against Machine Learning
R.I.P.
๐ป
Ghosted
Distillation as a Defense to Adversarial Perturbations against Deep Neural Networks
R.I.P.
๐ป
Ghosted