Bypassing the Safety Training of Open-Source LLMs with Priming Attacks
December 19, 2023 ยท Entered Twilight ยท ๐ Tiny Papers @ ICLR
Repo contents: LICENSE, LLAMA_2_LICENSE, README.md, attack_llama.py, attack_vicuna.py, data, environment.yml, few_shot_priming.py, generation_attack.py, get_accuracy.py, install.sh, llama_guard.py, llama_guard, notice.txt
Authors
Jason Vega, Isha Chaudhary, Changming Xu, Gagandeep Singh
arXiv ID
2312.12321
Category
cs.CR: Cryptography & Security
Cross-listed
cs.AI,
cs.CL,
cs.LG
Citations
42
Venue
Tiny Papers @ ICLR
Repository
https://github.com/uiuc-focal-lab/llm-priming-attacks
โญ 17
Last Checked
1 month ago
Abstract
With the recent surge in popularity of LLMs has come an ever-increasing need for LLM safety training. In this paper, we investigate the fragility of SOTA open-source LLMs under simple, optimization-free attacks we refer to as $\textit{priming attacks}$, which are easy to execute and effectively bypass alignment from safety training. Our proposed attack improves the Attack Success Rate on Harmful Behaviors, as measured by Llama Guard, by up to $3.3\times$ compared to baselines. Source code and data are available at https://github.com/uiuc-focal-lab/llm-priming-attacks.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Cryptography & Security
R.I.P.
๐ป
Ghosted
R.I.P.
๐ป
Ghosted
Membership Inference Attacks against Machine Learning Models
R.I.P.
๐ป
Ghosted
The Limitations of Deep Learning in Adversarial Settings
R.I.P.
๐ป
Ghosted
Practical Black-Box Attacks against Machine Learning
R.I.P.
๐ป
Ghosted
Distillation as a Defense to Adversarial Perturbations against Deep Neural Networks
R.I.P.
๐ป
Ghosted