CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations
July 08, 2025 · Declared Dead · 🏛 Annual Meeting of the Association for Computational Linguistics
Authors
Xiaohu Li, Yunfeng Ning, Zepeng Bao, Mayi Xu, Jianhao Chen, Tieyun Qian
arXiv ID
2507.06043
Category
cs.CR: Cryptography & Security
Cross-listed
cs.AI
Citations
3
Venue
Annual Meeting of the Association for Computational Linguistics
Repository
https://github.com/NLPGM/CAVGAN
⭐ 1
Last Checked
1 month ago
Abstract
Security alignment enables the Large Language Model (LLM) to gain the protection against malicious queries, but various jailbreak attack methods reveal the vulnerability of this security mechanism. Previous studies have isolated LLM jailbreak attacks and defenses. We analyze the security protection mechanism of the LLM, and propose a framework that combines attack and defense. Our method is based on the linearly separable property of LLM intermediate layer embedding, as well as the essence of jailbreak attack, which aims to embed harmful problems and transfer them to the safe area. We utilize generative adversarial network (GAN) to learn the security judgment boundary inside the LLM to achieve efficient jailbreak attack and defense. The experimental results indicate that our method achieves an average jailbreak success rate of 88.85\% across three popular LLMs, while the defense success rate on the state-of-the-art jailbreak dataset reaches an average of 84.17\%. This not only validates the effectiveness of our approach but also sheds light on the internal security mechanisms of LLMs, offering new insights for enhancing model security The code and data are available at https://github.com/NLPGM/CAVGAN.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
📜 Similar Papers
In the same crypt — Cryptography & Security
R.I.P.
👻
Ghosted
R.I.P.
👻
Ghosted
Membership Inference Attacks against Machine Learning Models
R.I.P.
👻
Ghosted
The Limitations of Deep Learning in Adversarial Settings
R.I.P.
👻
Ghosted
Practical Black-Box Attacks against Machine Learning
R.I.P.
👻
Ghosted
Distillation as a Defense to Adversarial Perturbations against Deep Neural Networks
R.I.P.
👻
Ghosted
Extracting Training Data from Large Language Models
Died the same way — ⚰️ The Empty Tomb
R.I.P.
⚰️
The Empty Tomb
DSFD: Dual Shot Face Detector
R.I.P.
⚰️
The Empty Tomb
InstanceCut: from Edges to Instances with MultiCut
R.I.P.
⚰️
The Empty Tomb
FLNet: Landmark Driven Fetching and Learning Network for Faithful Talking Facial Animation Synthesis
R.I.P.
⚰️
The Empty Tomb