基于文本自动化分类的隐私政策合规技术研究

牛犇; 李铂浩; 唐鹏; 孙雄韬; 侯雨桥; 李凤华

引用本文：

牛犇,李铂浩,唐鹏,孙雄韬,侯雨桥,李凤华.基于文本自动化分类的隐私政策合规技术研究[J].信息安全学报,已采用 [点击复制]
niuben,libohao,tangpeng,sunxiongtao,houyuqiao,lifenghua.Privacy Policy Compliance Techniques Based on Automated Text Classification[J].Journal of Cyber Security,Accept [点击复制]

本文已被：浏览 101次下载 0次
基于文本自动化分类的隐私政策合规技术研究
牛犇¹, 李铂浩¹, 唐鹏², 孙雄韬³, 侯雨桥¹, 李凤华¹
0 字体:加大+\|默认\|缩小-
(1.中国科学院信息工程研究所;2.上海交通大学网络空间安全学院;3.西安电子科技大学网络与信息安全学院)

摘要:

隐私政策通常包含大量的法律术语和复杂的语句结构，普通用户难以准确理解。此外，诸多隐私政策总会有意或无意间违反了相关的法律法规，其合规性问题面临巨大挑战。然而，现有的隐私政策信息提取方法在解决可读性问题时存在效果不佳、专业性不强和方法复杂的缺点；在解决隐私政策合规性问题上，存在覆盖范围不足、检测效果差、实现难度大等问题。因此，本文关注隐私政策的合规性检测问题，旨在优化隐私政策在法律层面上的可读性与合规性问题。本文从最具权威性的OPP-115隐私政策数据集出发，自主构建了基于隐私政策文本自动化分类的RoBERTa-PrivCapsNet模型，有效完成了隐私政策的信息提取与分类，展现出优异的性能，且部分指标优于目前已有的先进方案。然后，本文深入分析了《通用数据保护条例(General Data Protection Regulation, GDPR)》与OPP-115数据集之间的关联，据此提炼出七项隐私政策合规原则，全面覆盖了GDPR的核心要求。在此基础上，本文将提出的隐私政策文本自动化分类模型与所归纳出的隐私政策合规原则相结合，设计了一种全面的隐私政策合规检测方法，实现了面向GDPR的自动化隐私政策合规检测。在对414项隐私政策进行合规性检测与人工核对后，本文得出结论：所设计的合规性检测方法的平均准确率可达89.1%，测试的隐私政策不合规率为55%。研究结果表明，本文提出的隐私政策合规检测方法显著扩展了现有的合规覆盖范围，更为有效且精准地实现了隐私政策的合规性检测。

关键词: 隐私政策 GDPR 文本分类隐私政策合规原则合规性检测

DOI：

投稿时间：2024-06-11修订日期：2024-11-06

基金项目:

Privacy Policy Compliance Techniques Based on Automated Text Classification

niuben¹, libohao¹, tangpeng², sunxiongtao³, houyuqiao¹, lifenghua¹

(1.Institute of Information Engineering, Chinese Academy of Sciences;2.School of Cyberspace Security, Shanghai Jiao Tong University;3.School of Cyber Engineering, Xidian University,)

Abstract:

Privacy policy always contains a lot of legal terms and complex sentence structure, which is difficult for common users to understand correctly. In addition, many privacy policies may intentionally or unintentionally violate rele-vant laws and regulations, making compliance a huge challenge. However, existing privacy policy information ex-traction methods are ineffective, unprofessional and complex in addressing readability. When it comes to solving privacy policy compliance issues, there are problems such as insufficient coverage, poor detection effects, and dif-ficulty in implementation. Therefore, this thesis focuses on the problem of compliance detection of privacy policies, aiming to optimise the readability and compliance of privacy policies at the legal level. Starting from the most au-thoritative OPP-115 privacy policy dataset, we independently construct the RoBERTa-PrivCapsNet model based on automatic classification of privacy policy text, which effectively completes the privacy policy information extrac-tion and classification. The results show excellent performance, and some of the indicators are better than the ex-isting advanced schemes. Then, we deeply analyse the association between the General Data Protection Regulation (GDPR) and OPP-115 dataset, and extract seven privacy policy compliance principles, which comprehensively cover the core requirements of GDPR. On this basis, we combine the proposed automated classification model for privacy policy texts with the summarised privacy policy compliance principles to design a comprehensive privacy policy compliance detection method, which achieves automated privacy policy compliance detection for GDPR. After compliance detection and manual verification of 414 privacy policies, we conclude that the accuracy of the designed compliance detection method reaches 89.1%, and the non-compliance rate of the tested privacy policies is 55%. The results show that the proposed privacy policy compliance detection method significantly extends the ex-isting compliance coverage and achieves privacy policy compliance detection more effectively and accurately.

Key words: privacy policy GDPR text classification privacy policy compliance principles compliance detection