大模型对齐攻击综述

宫润森; 王凯; 张昱霖; 张伟哲; 乔延臣; 张玉清

引用本文：

宫润森,王凯,张昱霖,张伟哲,乔延臣,张玉清.大模型对齐攻击综述[J].信息安全学报,已采用 [点击复制]
Gong Runsen,Wang Kai,Zhang Yulin,Zhang Weizhe,Qiao Yanchen,Zhang Yuqing.A Survey of Adversarial Techniques Against Large Model Alignment[J].Journal of Cyber Security,Accept [点击复制]

本文已被：浏览 873次下载 0次
大模型对齐攻击综述
宫润森¹, 王凯¹, 张昱霖², 张伟哲³, 乔延臣³, 张玉清¹
0 字体:加大+\|默认\|缩小-
(1.中国科学院大学;2.西安电子科技大学;3.鹏城实验室)

摘要:

随着像ChatGPT这样的大型模型的问世，人工智能生成内容的安全性引起了越来越多研究者的关注。为了确保模型的最终行为与人类价值观一致，在模型应用部署过程中，对齐技术发挥了至关重要的作用。对齐技术通过微调或其他技术手段对不同的预训练模型进行调整，旨在提高他们在特定任务上的推理能力。针对对齐安全的攻击受到了学术界与工业界的广泛关注，但目前缺少一个针对大模型对齐攻击技术的系统性梳理。本文首先从部署态大模型的安全风险视角出发，对大模型整个部署过程中可能存在的安全漏洞和现有的大模型对齐技术进行了调查分析，对现有对齐攻击手段进行了全面调研，分析了针对部署态大模型的对齐攻击技术以及存在的安全威胁，并指出了对齐技术中存在的安全漏洞和潜在攻击手段。其次，本文从大模型的下游任务微调所带来的安全隐患的视角出发，分析了微调过程对大模型对齐安全性的破坏，还调研分析了微调过程中的一些行为可能引发对齐安全的漏洞。然后，本文从大模型的多模态发展的视角出发，介绍了多模态大模型（Multimodal Large Language，MLLM）的架构，充分总结和分析了模型不同模态之间的融合技术，并指出了由于MLLM输入的连续性带来的攻击隐蔽性的特点。最后，对大模型对齐攻击技术未来的发展方向进行了展望。通过深入探讨对齐攻击技术的现状和潜在风险，可以为学术界提供启发和新的研究思路和方向。

关键词: 大模型对齐微调大模型人工智能安全多模态大模型

DOI：

投稿时间：2024-04-19修订日期：2024-09-09

基金项目:国家重点研发计划项目（2023YFB3106400，2023QY1202），国家自然科学基金重点项目（U2336203，U1836210），海南省重点研发计划项目（GHYF2022010），北京市自然科学基金（4242031），国家自然科学基金资助项目（No. 62102202），鹏城实验室重大攻关项目（No. PCL2023A05）

A Survey of Adversarial Techniques Against Large Model Alignment

Gong Runsen¹, Wang Kai¹, Zhang Yulin², Zhang Weizhe³, Qiao Yanchen³, Zhang Yuqing¹

(1.University of Chinese Academy of Sciences;2.Xidian University;3.Pengcheng Laboratory)

Abstract:

With the advent of large models like ChatGPT, the security of AI-generated content has garnered increasing attention from researchers. To ensure that the final behavior of models aligns with human values, alignment techniques play a crucial role during model deployment. These techniques adjust different pre-trained models through fine-tuning or other methods to enhance their reasoning capabilities on specific tasks. Alignment security attacks have attracted widespread attention from academia and industry, but there is currently a lack of systematic review on alignment attack techniques for large models. This paper begins by examining the security risks faced during the deployment stage of aligned large models. It investigates potential vulnerabilities throughout the deployment process and reviews existing alignment tech-niques. A comprehensive study of current alignment attack methods is conducted, including prompt injection attacks, adversarial attacks, privacy leakage attacks, and backdoor trigger attacks. The analysis identifies security vulnerabilities and potential attack techniques within alignment methods. Secondly, from the perspective of security risks posed by fi-ne-tuning downstream tasks, the paper analyzes how the fine-tuning process compromises the security limitations of aligned large models. It investigates behaviors during the fine-tuning process that may cause alignment security vulner-abilities, providing a detailed analysis of the impact of fine-tuning on the security of secondarily developed models. Thirdly, from the perspective of multimodal development of large models, the paper introduces the architecture of mul-timodal large language models (MLLM). It summarizes and analyzes the fusion technologies between different modali-ties within these models and highlights the characteristics of attack concealment due to the continuity of MLLM inputs. Finally, the paper provides an outlook on the future development direction of alignment attack techniques for large mod-els. By deeply exploring the current state and potential risks of alignment attack techniques, the research aims to inspire new ideas and directions in academia.

Key words: Large Model Alignment Fine-tuning Large Model AI Security Multimodal Large Language Model