Safety Classification Fine-tuning：一种提高大模型输出内容安全性的微调方法

于淼; 孙磊; 胡翠云; 臧韦菲; 郭松; 胡鹏

引用本文：

于淼,孙磊,胡翠云,臧韦菲,郭松,胡鹏.Safety Classification Fine-tuning：一种提高大模型输出内容安全性的微调方法[J].信息安全学报,已采用 [点击复制]
Yu Miao,Sun Lei,Hu Cuiyun,Zang Weifei,Guo Song,Hu Peng.Safety Classification Fine-tuning：A fine-tuning method to improve the output content safety of LLMs[J].Journal of Cyber Security,Accept [点击复制]

本文已被：浏览 22次下载 0次
Safety Classification Fine-tuning：一种提高大模型输出内容安全性的微调方法
于淼, 孙磊, 胡翠云, 臧韦菲, 郭松, 胡鹏
0 字体:加大+\|默认\|缩小-
(中国人民解放军战略支援部队信息工程大学)

摘要:

指令微调模型因其出色的指令理解和遵循能力在众多领域和任务中得到了广泛应用。然而这种能力也容易被恶意利用，诱导模型生成有害内容。目前，提高指令微调模型输出内容安全性的方法还存在一些不足，如安全微调会破坏模型的有用性，而且对越狱攻击的防御能力不足，借助预训练的内容审核模型进行内容过滤会降低模型的响应速度。针对这些问题，本文提出了一种新的微调方法SCFT(Safety Classification Fine-tuning)。这一动机是我们发现，指令微调模型之所以容易被滥用，是因为它缺少判别“指令-回复”安全性的能力，而模型最后一层解码器输出的隐藏状态中用来结束句子的EOS标记的词嵌入向量包含了整个句子的语义信息，非常适合用来判别句子的安全性。但模型的基础结构决定了它不具备分类能力。因此，我们在模型的输出层添加了一个新的分类头，在针对通用能力进行指令微调的同时训练该分类头基于句子的语义信息进行以“安全”/“不安全”为标签的分类任务，训练好的分类头就成为了模型内部控制输出内容安全性的机制——“判别力机制”，使得微调后的模型能够在推理时主动判断指令和回复的安全性，并阻止不安全内容的输出。进一步分析发现，借助“判别力机制”，SCFT能够将模型有用性和安全性的训练目标统一起来，实现两者之间的更好平衡，还能使模型通用能力和安全能力的知识保持对称，并将安全性的训练数据扩展到预训练的数据分布中，增强模型安全能力的鲁棒性。实验结果表明，SCFT是一种资源高效的端到端(end to end)安全微调方法，它在不增加额外计算资源、不影响模型通用能力的前提下，将微调后模型的有害输出率降低了约91%，平均有害分数从超过4分降至1.36分(5分制，分数越高模型越有害)，越狱攻击的有害率为0%。

关键词: 大型语言模型内容安全指令微调

DOI：

投稿时间：2024-07-21修订日期：2024-11-08

基金项目:河南省自然科学基金（No.242300420699）、军队科技委项目（JJK2023-449）

Safety Classification Fine-tuning：A fine-tuning method to improve the output content safety of LLMs

Yu Miao, Sun Lei, Hu Cuiyun, Zang Weifei, Guo Song, Hu Peng

(Information Engineering University)

Abstract:

Instruction-tuned models have been widely applied across various fields and tasks due to their excellent ability to understand and follow instructions. However, this capability is also prone to malicious exploitation, leading the model to generate harmful content. Current methods for enhancing the safety of the output content of instruction-tuned mod-els still have some shortcomings, such as the safety-tuning that can undermine the model's helpfulness, and lack of robust defense against jailbreak attacks, and using pre-trained content moderation models for content filtering can slow down the model's response speed. In response to these challenges, this paper introduces a novel fine-tuning approach known as Safety Classification Fine-tuning (SCFT). The motivation for SCFT is the observation that instruction-tuned models are vulnerable to misuse due to their inability to assess the safety of "instruction-response" pairs. The embed-ding vector of the EOS token in the hidden state output of the model's final decoding layer, which is used to end sen-tences, contains the semantic information of the entire sentence and is very suitable to judge the safety of sentences. However, the fundamental structure of the model determines that it does not have classification capabilities. To address this, we have added a new classification head to the model's output layer. This head is trained to classify sentences as "safe" or "unsafe" based on sentences' semantic information, while the model is simultaneously instruction-tuned for general capabilities. The well-trained classification head acts as an internal "discrimination mechanism," controlling the safety of the model's output. This allows the fine-tuned model to actively judge the safety of "instruction-response" pairs during inference and prevent the output of unsafe content. Further analysis reveals that with the "discrimination mechanism", SCFT can unify the training objectives of the model's utility and safety, achieving a better balance be-tween the two. It also maintains the symmetry of knowledge between the model's general capabilities and safety ca-pabilities, and expanding the safety training data to the pre-training data distribution, enhancing the robustness of the model's safety capabilities. Experimental results demonstrate that SCFT is a resource-efficient, end-to-end safe-ty-tuning method. It significantly reduces the Harmfulness Rate of the fine-tuned model by approximately 91%, lowers the average harmfulness score from over 4 points to 1.36 (on a scale of 5, the higher the score, the more harmful the model), and achieves a 0% harmfulness rate in jailbreak attacks, all without increasing additional computing resources or compromising the model's general capabilities.

Key words: Large language models Content safety Instruction fine-tuning