基于图像可视化的恶意软件分类技术综述

钱丽萍; 王大伟

引用本文：

钱丽萍,王大伟.基于图像可视化的恶意软件分类技术综述[J].信息安全学报,2024,9(5):139-161 [点击复制]
QIAN Liping,WANG Dawei.A Survey on Image Visualization Approaches-based Malware Classification Techniques[J].Journal of Cyber Security,2024,9(5):139-161 [点击复制]

本文已被：浏览 943次下载 554次	码上扫一扫！
基于图像可视化的恶意软件分类技术综述
钱丽萍¹, 王大伟²
0 字体:加大+\|默认\|缩小-
(1.北京建筑大学电气与信息工程学院北京中国 100044;2.国家计算机网络应急技术处理协调中心北京中国 100029)

摘要:

恶意软件在研制中日益呈现出规模化、家族化、自动化趋势,并普遍采用加密和混淆技术去对抗检测,既带来恶意软件数量快速增长,其内在隐含的特征又为深度学习检测提供了潜在可能,因此主流检测和分类方法已从基于人工特征匹配转向基于机器学习及深度学习自动挖掘。恶意软件分类模型的性能往往取决于人工专家所挖掘的分类特征的质效。将恶意软件映射为图像,既有助于缓解人工特征工程面临的专业知识匮乏,亦可自然借鉴图像分类领域的最先进成果,基于图像可视化的恶意软件分类技术成为一个重要的研究方向。本综述对基于图像可视化的恶意软件分类技术进行总结,重点研究了恶意软件图像的生成模式,包括图像大小、灰度或彩色通道选择、像素位置映射以及像素值计算等,对比分析不同图像表征和特征抽取方法以及分类性能。由不同文献试验结果,可以发现上述因素对恶意软件的分类性能均有影响。然后总结基于图像可视化的恶意软件分类方法的优势,提出其面临的主要问题和挑战,其中优势包括有利于缓解对专家知识的强依赖、更适用于恶意软件变种的检测及可以借鉴图像处理领域的系列研究成果;问题和挑战包括恶意载荷定位困难、模型适用性、结果可解释性以及高质量标注数据稀缺问题。面对这些问题及挑战,展望了未来几个可行且重要的研究方向,包括:基于认知的深度学习模型、恶意软件领域知识图谱、恶意软件样本对抗增强以及评估基准与高质量数据集。

关键词: 恶意软件分类恶意软件可视化特征工程 x-plot 数据集数据增强

DOI：10.19363/J.cnki.cn10-1380/tn.2024.09.07

投稿时间：2023-07-09修订日期：2023-12-07

基金项目:本课题得到国家重点研发计划项目(No.2022YFC3321101,No.2022YFB3103705);国家自然科学基金项目(No.61571144)资助。

A Survey on Image Visualization Approaches-based Malware Classification Techniques

QIAN Liping¹, WANG Dawei²

(1.College of Electrical & Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing 100044, China;2.National Computer Network Emergency Response Technical Team/Coordination Center of China, Beijing 100029, China)

Abstract:

As matters stand, malware makers have been developing malwares at the scale of an automated and family-run manner and generally utilizing code encryption and obfuscation techniques to combat the malware detection systems. The trend is profoundly mixed: the number of malwares increases rapidly and the potentiality of mining implicit features of malware variants with deep learning. The mainstream approach for malware detection or classification has shifted from artificial feature matching to deep mining. Performance of malware classification models often depends on the quality and efficiency of the artificial feature engineering. Mapping malware into image not only tends to alleviate the lack of professional knowledge in artificial feature engineering, but naturally borrows the achievements in the field of image processing. Image visualization approaches-based malware classification techniques has become an attractive research direction. The survey summarizes the research progress in image visualization approaches-based malicious executable classification techniques and focuses on the methods for generating image from executable file. We systematically summarize the methods for generating the image from a malware binary file, including image size setting, gray or color channels choice, pixel coordinates projection and pixel value computation, and gives comparative analysis on techniques of feature representation and extraction for visualized malware. Results from the listed literatures shows that the above factors all have an impact on the performance of malware classification. We also conclude its advantages and main difficulties and challenges encountered, which includes advantages of relieving strong reliance on expert knowledge, more applicable for detecting malware variants and drawing on series of research achievements in the field of image processing, and difficulties and challenges in locating malicious payloads, model applicability, interpretability and lack of high-quality labeled data. We then present several interesting directions for future research, such as cognitive DL models, malware knowledge graph, adversarial malware data augmentation, evaluation benchmark and high-quality dataset.

Key words: malware classification malware visualization feature engineering x-plot dataset data augmentation