Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

September 26, 2024 · Declared Dead · 🏛 arXiv.org

Repo contents: .gitignore, README.md, finetuning_stage_illustration.png, harmful_finetuning.png, harmful_finetuning.pptx, illustration.png, image.png, nohup.out, organization.png, survey_slide.pdf

Authors Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Ling Liu arXiv ID 2409.18169 Category cs.CR: Cryptography & Security Cross-listed cs.AI, cs.LG Citations 82 Venue arXiv.org Repository https://github.com/git-disl/awesome_LLM-harmful-fine-tuning-papers ⭐ 232 Last Checked 1 month ago

Abstract

Recent research demonstrates that the nascent fine-tuning-as-a-service business model exposes serious safety concerns -- fine-tuning over a few harmful data uploaded by the users can compromise the safety alignment of the model. The attack, known as harmful fine-tuning attack, has raised a broad research interest among the community. However, as the attack is still new, \textbf{we observe that there are general misunderstandings within the research community.} To clear up concern, this paper provide a comprehensive overview to three aspects of harmful fine-tuning: attacks setting, defense design and evaluation methodology. Specifically, we first present the threat model of the problem, and introduce the harmful fine-tuning attack and its variants. Then we systematically survey the existing literature on attacks/defenses/mechanical analysis of the problem. Finally, we introduce the evaluation methodology and outline future research directions that might contribute to the development of the field. Additionally, we present a list of questions of interest, which might be useful to refer to when reviewers in the peer review process question the realism of the experiment/attack/defense setting. A curated list of relevant papers is maintained and made accessible at: https://github.com/git-disl/awesome_LLM-harmful-fine-tuning-papers.