基于API分组重构与图像表示的恶意软件检测分类

杨宏宇; 张宇沛; 张良; 成翔

引用本文：

杨宏宇,张宇沛,张良,成翔.基于API分组重构与图像表示的恶意软件检测分类[J].信息安全学报,2024,9(5):110-126 [点击复制]
YANG Hongyu,ZHANG Yupei,ZHANG Liang,CHENG Xiang.Malware Detection and Classification Based on API Block Reconstruction and Image Representation[J].Journal of Cyber Security,2024,9(5):110-126 [点击复制]

本文已被：浏览 964次下载 589次	码上扫一扫！
基于API分组重构与图像表示的恶意软件检测分类
杨宏宇^1,2, 张宇沛², 张良³, 成翔^4,5
0 字体:加大+\|默认\|缩小-
(1.中国民航大学安全科学与工程学院天津中国 300300;2.中国民航大学计算机科学与技术学院天津中国 300300;3.亚利桑那大学信息学院图森美国 AZ85721;4.扬州大学信息工程学院扬州中国 225127;5.江苏省知识管理与智能服务工程研究中心扬州中国 225127)

摘要:

针对目前恶意软件检测分类方法在特征提取、检测准确率等方面面临的挑战,提出一种基于API分组重构与图像表示的恶意软件检测分类方法。首先,对恶意软件调用的API类别统一编号,将API指令序列中相同编号的API聚合为同一API组,根据恶意软件运行时各类API的首次调用顺序对API组重排序,将各API组的条目数记录为该类API对软件样本的贡献度。经分组重构后,各API组按序组织,其顺序为软件样本调用各类API的顺序。各API组内部有序,其内部各API的排列顺序即为软件样本对单个API的调用顺序。有序化的API分组有助于API指令序列信息的图像化表达。基于重组的API指令序列提取API编号作为全局特征列表、API贡献度作为局部特征列表、API顺序索引作为时序特征列表,对特征列表进行标准化与零填充,转化为统一尺寸的特征数组。其中,API编号能清晰地标识API类别,API贡献度可以表征该API的调用频繁程度,API顺序索引可区分各API被调用的顺序。然后,分别用3类特征数组填充RGB图像的3个通道,生成3通道的API编号贡献度及顺序索引特征图像(Feature image of API code devotion and sequential index,FimgCDS)。最后,将FimgCDS特征图像输入自主构建的轻量型恶意软件特征图像卷积神经网络(malware feature image convolutional neural network,MficNN)分类器,实现对恶意软件的检测与分类。实验结果表明,本文方法在两类数据集上的检测分类准确率分别为98.66%和98.35%,具有较高的恶意软件检测分类性能指标和检测分类速度。

关键词: 恶意软件分类 API 特征提取图像表示 RGB图像卷积神经网络

DOI：10.19363/J.cnki.cn10-1380/tn.2024.09.05

投稿时间：2022-08-30修订日期：2022-12-05

基金项目:本课题得到国家自然科学基金资助项目(No.U1833107)资助。

Malware Detection and Classification Based on API Block Reconstruction and Image Representation

YANG Hongyu^1,2, ZHANG Yupei², ZHANG Liang³, CHENG Xiang^4,5

(1.School of Safety Science and Engineering, Civil Aviation University of China, Tianjin 300300, China;2.School of Computer Science and Technology, Civil Aviation University of China, Tianjin 300300, China;3.School of Information, University of Arizona, Tucson AZ85721, USA;4.School of Information Engineering, Yangzhou University, Yangzhou 225127, China;5.Jiangsu Engineering Research Center for Knowledge Management and Intelligent Service, Yangzhou 225127, China)

Abstract:

To address the challenges faced by current malware detection and classification methods in terms of feature extraction and detection accuracy, a malware detection and classification method based on API block reconstruction and image representation was proposed. First, the API categories invoked by malware during the malware runtime were numbered uniformly and aggregate the APIs with the same code into the same API block, and the API blocks were reordered according to the invocation order of each API, the number of entries in each API block was recorded as the devotion of such API. After reconstruction, each API block is organized in order, and its order is the order in which each type of API is called by the software sample. The order within each API block is the order in which the software sample calls the individual APIs. The ordered API block sequence helps to represent the API instruction sequence information pictorially. The API codes were extracted as the global feature list, the API devotion as the local feature list, and the API sequential indexes as the temporal feature list, and the feature lists were normalized and zero-padded to transform into feature arrays. The API code clearly identifies the API category, the API devotion characterizes how frequently the API is called, and the API sequential index distinguishes the order in which each API is called. Then, the 3 channels of the RGB image were filled with the 3 types of feature arrays to generate the feature image of API code devotion and sequential index (FimgCDS). Finally, the FimgCDS feature image was fed into a self-built lightweight malware feature image convolutional neural network (MficNN) classifier for malware detection and classification. The experimental results show that the detection and classification accuracies of the method are 98.66% and 98.35% on the two datasets, and the method has high detection and classification performance indicators and speed for malware.

Key words: malware classification application programming interface feature extraction image representation RGB image convolutional neural network