摘要: |
物联网和大数据技术的应用普及大大方便人们生活,也由此产生了大量的高维数据。通过对发布的高维数据进行分析,得到数据隐含价值与知识可为政府或企事业单位做决策过程中提供指导依据。然而,由于高维数据中通常包含个人敏感信息,直接发布原始高维数据会对个人隐私造成严重威胁。差分隐私是一种在不泄露个人敏感信息的情况下,用于数据发布与分析的具备严格形式化定义的隐私保护框架。但现有的差分隐私高维数据发布方法,存在数据降维处理时无法充分捕获数据间关系和数据分布模型定义不准确的问题。为解决上述问题,本文提出一种基于高斯生成模型的差分隐私高维数据发布方法。首先利用最大信息系数和Dvoretzky定理对高维数据进行预处理,滤除原始数据中无用或有缺失值的稀疏属性,降低因数据稀疏性引入额外扰动误差对隐私保护水平造成的影响。然后将预处理后的数据进行投影变换,使高维数据在低维空间上的投影接近高斯分布。最后利用投影数据训练差分隐私高斯生成模型,由该模型产生合成数据代替原始高维数据发布。该方法通过设计适用于高维数据的预处理方法对基于高斯生成模型的差分隐私高维数据发布方法进行优化,在保留原始数据多种函数关系的基础上,解决因数据分布未知或模型定义不准确导致高维数据发布结果可用性低的问题。理论分析和实验结果也证明了所提出的算法相比于同类算法具有更好的可用性。 |
关键词: 差分隐私 高维数据 数据发布 高斯生成模型 最大信息系数 |
DOI:10.19363/J.cnki.cn10-1380/tn.2023.09.16 |
投稿时间:2021-09-24修订日期:2021-12-02 |
基金项目:国家自然科学基金项目(面上项目,重点项目,重大项目) |
|
High-dimensional Data Publishing with Differential Privacy Protection |
shenbo, zhangrui
|
(Institute of Information Engineering, CAS) |
Abstract: |
The popularization of IoT and big data technology has greatly facilitated people"s life, and thus produced a large amount of high-dimensional data. Through the analysis of the published high-dimensional data, the implicit value and knowledge of data can provide guidance for the government or enterprises and institutions in the decision-making process. However, because high-dimensional data often contains personal sensitive information, its direct publish will pose a serious threat to personal privacy. Differential privacy is a privacy protection framework with strict formal definition for data publishing and analysis without revealing personal sensitive information. However, the existing differential privacy high-dimensional data publishing methods have the problems that the relationship between data cannot be fully captured in the process of data dimensionality reduction and the definition of the data distribution model is inaccurate. To solve the above problems, this paper proposes a differential privacy high-dimensional data publishing method based on Gaussian generative model. First, we use the maximum information coefficient and Dvoretzky"s theorem to preprocess high-dimensional data, filter out the useless or missing value sparse attributes in the original data and reduce the impact of additional disturbance errors introduced by data sparsity on the level of privacy protection. Then the preprocessed data is subjected to projection transformation, so that the projection of the high-dimensional data on the low-dimensional space is conformed to the Gaussian distribution. Finally, the projection data is used to train the differential privacy Gaussian generative model, and the synthetic data is generated by the model to replace the original high-dimensional data for publishing. By designing a preprocessing method suitable for high-dimensional data, this method optimizes the differential privacy high-dimensional data publishing method based on Gaussian generative model, and solves the problem of low utility of high-dimensional data publishing results due to unknown data distribution or inaccurate model definition on the basis of retaining multiple functional relationships of the original data. Theoretical analysis and experimental results show that the proposed algorithm has better utility than similar algorithms. |
Key words: Differential privacy High dimensional data Data publishing Gauss generative model Maximum information coefficient |