基于深度学习与特征融合的恶意网页识别方法研究

杨胜杰; 陈朝阳; 徐逸; 刘建刚

引用本文：

杨胜杰,陈朝阳,徐逸,刘建刚.基于深度学习与特征融合的恶意网页识别方法研究[J].信息安全学报,2024,9(3):176-190 [点击复制]
YANG Shengjie,CHEN Zhaoyang,XU Yi,LIU Jiangang.Research on Malicious Web Page Identification Method Based on Deep Learning and Feature Fusion[J].Journal of Cyber Security,2024,9(3):176-190 [点击复制]

本文已被：浏览 1631次下载 1279次	码上扫一扫！
基于深度学习与特征融合的恶意网页识别方法研究
杨胜杰¹, 陈朝阳¹, 徐逸¹, 刘建刚²
0 字体:加大+\|默认\|缩小-
(1.湖南工商大学计算机学院长沙中国 410205;2.湖南工商大学理学院长沙中国 410205)

摘要:

互联网环境的高度开放性和无序性导致了网络安全问题的普遍性和不可预知性, 网络安全问题已成为当前国际社会关注的热点问题。基于机器学习的恶意网页识别方法虽然卓有成就, 但随着对恶意网页识别需求的不断提高, 在识别效率上仍然表现出较大的局限性。本文提出一种基于深度学习与特征融合的识别方法, 将图卷积神经网络(Generalized connection network,GCN)与一维卷积神经网络(Convolution neural network, CNN)、支持向量机(Support vector machine, SVM)相结合。首先, 考虑到传统神经网络只适用于处理结构化数据以及无法很好的捕获单词间非连续和长距离依赖关系, 从而影响网页识别准确率的缺点,通过 GCN 丰富的关系结构有效捕获并保持网页文本的全局信息; 其次, CNN 可以弥补 GCN 在局部特征信息提取方面的不足,通过一维 CNN 对网页 URL(Uniform resource locator, URL)进行局部信息提取, 并进一步将捕获到的 URL 局部特征与网页文本全局特征进行融合, 从而选择出兼顾 CNN 模型和 GCN 模型特点的更具代表性的网页特征; 最终, 将融合后的特征输入到 SVM分类器中进行网页判别。本文首次将 GCN 应用于恶意网页识别领域, 通过组合模型有效兼顾了深度学习与机器学习的优点, 将深度学习网络模型作为特征提取器, 而将机器学习分类算法作为分类器, 通过实验证明, 测试准确率达到 92.5%, 高于已有的浅层的机器学习检测方法以及单一的神经网络模型。本文提出的方法具有更高的稳定性, 以及在精确率、召回率、 F1 值等多项检测指标上展现出更加优越的性能。

关键词: 恶意网页机器学习深度学习特征融合

DOI：10.19363/J.cnki.cn10-1380/tn.2024.05.12

投稿时间：2022-10-07修订日期：2023-01-06

基金项目:本课题得到湖南教育厅科学研究项目(No. 21A0385, No. 22B0612)和湖南省自然科学基金面上项目(No. 2022JJ30214)资助。

Research on Malicious Web Page Identification Method Based on Deep Learning and Feature Fusion

YANG Shengjie¹, CHEN Zhaoyang¹, XU Yi¹, LIU Jiangang²

(1.School of Computer Science, Hunan University of Technology and Business, Changsha 410205, China;2.School of Science, Hunan University of Technology and Business, Changsha 410205, China)

Abstract:

The high degree of openness and disorder of the Internet environment has led to the universality and unpredictability of network security issues. Network security issues have become a hot issue that the international community is currently concerned about. Although the method of identifying malicious web pages based on machine learning has made great achievements, with the continuous improvement of the demand for identifying malicious web pages, it still shows great limitations in the identification efficiency. In this paper, a recognition method based on deep learning and feature fusion is proposed, which combines graph convolutional neural network (GCN) with one-dimensional convolutional neural network (CNN), support vector machine (SVM) combined. First, considering the shortcomings of traditional neural networks that are only suitable for processing structured data and cannot extract the discontinuity and long distance dependence between words well, which affects the accuracy of web page recognition, the rich relational structure of GCN effectively captures and maintains the global context of web page texts. Secondly, CNN can make up for the deficiency of GCN in extracting local feature information, the local information of the URL of the web page is extracted by one-dimensional CNN, and the local features of the captured URL are further fused with the global features of the web page text, so as to select More representative webpage features that take into account the characteristics of the CNN model and the GCN model; finally, input the fused features into the SVM classifier for webpage discrimination. In this paper, GCN is applied to the field of malicious web page identification for the first time, and the advantages of deep learning and machine learning are effectively taken into account through the combined model. The deep learning network model is used as the feature extractor, and the machine learning classification algorithm is used as the classifier. The accuracy rate reaches 92.5%, which is higher than the existing shallow machine learning detection methods and a single neural network model. The method proposed in this paper has higher stability, and shows more superior performance in multiple detection indicators such as precision rate, recall rate, and F1 value.

Key words: malicious web page machine learning deep learning feature fusion