Harnessing LLM to Attack LLM-Guarded Text-to-Image Models

December 12, 2023 Β· Declared Dead Β· + Add venue

πŸ’€ CAUSE OF DEATH: 404 Not Found
Code link is broken/dead
Authors Yimo Deng, Huangxun Chen arXiv ID 2312.07130 Category cs.AI: Artificial Intelligence Citations 5 Repository https://github.com/researchcode003/DACA Last Checked 1 month ago
Abstract
To prevent Text-to-Image (T2I) models from generating unethical images, people deploy safety filters to block inappropriate drawing prompts. Previous works have employed token replacement to search adversarial prompts that attempt to bypass these filters, but they have become ineffective as nonsensical tokens fail semantic logic checks. In this paper, we approach adversarial prompts from a different perspective. We demonstrate that rephrasing a drawing intent into multiple benign descriptions of individual visual components can obtain an effective adversarial prompt. We propose a LLM-piloted multi-agent method named DACA to automatically complete intended rephrasing. Our method successfully bypasses the safety filters of DALL-E 3 and Midjourney to generate the intended images, achieving success rates of up to 76.7% and 64% in the one-time attack, and 98% and 84% in the re-use attack, respectively. We open-source our code and dataset on [this link](https://github.com/researchcode003/DACA).
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

πŸ“œ Similar Papers

In the same crypt β€” Artificial Intelligence

Died the same way β€” πŸ’€ 404 Not Found