A Comparative Study of Data Perturbation Using Fuzzy Logic to Preserve Privacy

August 8, 2017 | Autor: Thanveer Jahan | Categoria: Information Security, Machine Learning, Data Mining, Computer Security

Share Embed

Denunciar este link

Descrição do Produto

Chapter 13

A Comparative Study of Data Perturbation Using Fuzzy Logic to Preserve Privacy Thanveer Jahan, G. Narasimha, and C.V. Guru Rao

Abstract The latest advances in the field of information technology have increased enormous growth in the data collection in this era. Individual’s data are shared for business or legal reasons, containing sensitive information. Sharing data is a mutual benefit for business growth. The need to preserve privacy has become a challenging problem in privacy preserving data mining. In this paper we deal with a data analysis system having sensitive information. Exposing the information of an individual leads to security threats and could be harmful. The confidential attributes are perturbed or distorted using fuzzy logic. Fuzzy logic is used to protect individual’s data to hide details of data in public. Data is owned by an authorized user, and applies distortion. The Authorized user having original dataset distorts numeric data using S-fuzzy membership function. This distorted data is published to the analyst, hiding the sensitive information present in the original data. The analysts perform data mining techniques on the distorted dataset. Accuracy is measured using classification and clustering techniques generated on distorted data is relative to the original, thus privacy is achieved. Comparison of various classifiers is generated on the original and distorted datasets.

13.1

Introduction

In the process of data publishing large volumes of personal data are collected. The increase of technology and global networking database sharing has become a common phenomenon. It can be of criminal records, credit records or a hospital

T. Jahan (*) • G. Narasimha Department of Computer Science and Engineering, JNTU, Hyderabad, AP, India e-mail: [email protected]; [email protected] C.V. G. Rao Department of Computer Science and Engineering, S.R Engg, Warangal, AP, India e-mail: [email protected] N. Meghanathan et al. (eds.), Networks and Communications (NetCom2013), Lecture Notes in Electrical Engineering 284, DOI 10.1007/978-3-319-03692-2_13, © Springer International Publishing Switzerland 2014

161

162

T. Jahan et al.

releasing patient’s record. Data is sensitive to privacy issues. Defense applications, financial transactions, healthcare records and network communication traffic [1]. The researchers or data analysts use these data to analyze by data mining techniques. Data mining is the process of gathering and collecting data to extract information. Analyzing such raw data can cause threat to privacy. Data containing sensitive or confidential information is protected using privacy preserving data mining. Many approaches have been employed in preserving privacy Randomization, Anonymization and secure multiparty computation. Randomization method consists of data perturbation or data modification which perturbs the confidential attributes. Classes of methods are proposed for privacy protection in data processing that is used in analysis system. Data perturbation methods are used to modify data or add noise to data [2], data mining techniques have proved that original and perturbed data are relatively same and accuracy is measured by different classifiers. The dimensionality of the matrix is reduced by transforming original dimension of data. Wang et al. [3] suggested significance of feature selection for analysis purpose and suggested that performing SSVD and feature selection is a better approach for classification purpose, while discarding features having small distorted values. Various methods are adopted for preserving privacy such as data swapping [4, 5] the attributes are interchanged with a higher probability. In Aggregation [6] the row is represented as group of values. The Fourier and signal Transformation [7, 8] methods are fast improving time complexity. In data Anonymization different approaches such as generalization and suppression methods are used, while k-anonymity protects identity disclosure but not attribute. In secure multi party computation (SMC) [9] data is encrypted using protocols such as secure sum, secure union and secure without revealing private data to the data miners. B. Karthikeyan et al. [10] used fuzzy membership function on original data, proved efficient increase and decreased the number of passes to perform clustering. In this paper we extend our work on an application where the information is imprecise and fuzzy logic provides better solution [5]. The individual information is preserved revealing details in public using fuzzy reasoning. The confidential attributes are modified using s-based horizontally distributed data by performing union of all individual entities. The distorted data is analyzed using data mining techniques such as classification. Numbers of methods are used to preserve privacy which increases complexity and processing time. An optimum solution is achieved in this paper using fuzzy based approach. The rest of the paper is organized as follows: Sect. 13.1 is the literature survey, Sect. 13.2 is the background work on privacy preserving data mining. Section 13.3 describes fuzzy based approach used in privacy preserving. Sections 13.4 and 13.5 are the classification and Clustering used on datasets. Section 13.6 describes about the proposed method, experimental results and the comparison between the classifiers and clustering on original and distorted datasets, and finally Sect. 13.7 sums up with conclusion of the work proposed and future work.

13

A Comparative Study of Data Perturbation Using Fuzzy Logic to Preserve Privacy

13.2

163

Background and Related Work

13.2.1 Privacy Preserving Data Mining The main aim of preserving privacy is hiding sensitive data, while it is been published. The raising concern of privacy had led disclosure of information. Data can reside at a single organization or in different places i.e. distributed data. In such scenarios relevant algorithms are used to protect data in privacy preserving data mining (PPDM). Many approaches are adopted to solve these issues, developing algorithms to modify the original data in such a way that data and knowledge remain private even after mining process [11]. Techniques include data perturbation, blocking feature values, swapping tuples etc. PPDM scheme should able to maximize the degree of data modification to retain the maximum data utility.

13.2.2 Analysis System and Perturbation A Data analysis model is shown in Fig. 13.1 consists of two parts an authorized user and data analyst [12]. The authorized user owns an original data and manipulates the data. Data is represented in tabular form having rows and columns. The original data has sensitive information and should be disclosed for privacy. Authorized user manipulates original data into perturbed data. The perturbed data is called as a fuzzy data. Fuzzy data hides the sensitive information of an original data. During data publishing user gives fuzzy data to data analyst. Data analyst collects fuzzy data to perform data mining techniques. In this way data is protected by an authorized user distorting the actual values by fuzzy values. Data mining techniques used by analyst are classification, clustering.

Original Data

Data classification

Fuzzy Data Data clustering Data manipulation Analyst Authorized user

Fig. 13.1 Data analysis system

Data analysis

164

13.3

T. Jahan et al.

Fuzzy Based Approach

Fuzzy sets are the extension of generic set theory, it is introduced in Ref. [13] has a different approach to preserve privacy. The main characteristics of fuzzy sets contrasting with crisp set, is the progressive transition from one set to another. The natural characteristic of fuzzy logic provide automatic mechanism to deal with imprecision and uncertainty, which are inherent to real world knowledge. The assessment of data set can be done using fuzzy membership in fuzzy sets [14]. A fuzzy set is a pair (A, μA) where A is a set and μA : A ! [0, 1]. For all x є A, μA(x) is called the grade of membership of x. Each linguistic term can be represented as a fuzzy set having its own membership function [15]. An S-shaped fuzzy membership function is given as: 8 > > > > > > > > > > <

0 0, 12 x aA 2@ , ba f ðx; a; bÞ ¼ 0 12 > > > x b > A, > 1 2@ > > ba > > > : 1,

9 > > > xa > > a þ b> > > > ax = 2 > aþb > x b> > > > 2 > > > > > xb ;

Where x is the value of the sensitive attribute, a & b are the minimum and maximum value of the sensitive attribute in the original data set.

13.4

Classification

Data mining utilities are used to assess an original dataset and dataset after perturbation. The analyst performs data mining techniques such as classification, clustering on distorted data. In this paper we used various classifiers such as SVM, ID3 and C4.5 on original data and perturbed data. The accuracy results have found the best classifier among them. The above graph proves that SVM gives promising accuracy results than ID3 and C4.5. The data before and after perturbation is relatively same and is proved by mining utility. Classification is a process of finding a set of models that describe and distinguish data classes and concepts. The purpose of being able to use model is to predict class, where label is unknown. Classification is a two step process shown in Fig. 13.2. (1) Build classification model using training data. Every object of the data must be pre-classified i.e. its class label must be known. (2) The model generated in the preceding step is tested by assigning class labels to data objects in a test dataset. The test data may be different from the training data. Every element of the test data is also pre-classified in advance. The accuracy of the classification model is determined by comparing true class labels in the testing set with those assigned by model.

13

A Comparative Study of Data Perturbation Using Fuzzy Logic to Preserve Privacy

Training Data

Build model

165

Classified Model

Build model

Assign Classes

Test Data

Accuracy

New Data

Fig. 13.2 Classification process

13.5

K-Means Clustering

Clustering is a well-known problem in statistics and engineering, namely, how to arrange a set of vectors (measurements) into a number of groups (clusters). Clustering is an important area of application for a variety of fields including data mining, statistical data analysis and vector quantization [16]. The problem has been formulated in various ways in the machine learning, pattern recognition optimization and statistics literature. The fundamental clustering problem is that of grouping together (clustering) data items that are similar to each other. Given a set of data items, clustering algorithms group similar items together. Clustering has many applications, such as customer behavior analysis, targeted marketing, forensics, and bioinformatics.

13.6

Experimental Results

In this paper we have used a real world datasets Fertility, Hepatitis and Iris datasets downloaded from UCI machine learning Repository having details of patients of hepatitis. These Datasets have the sensitive attribute an age of the patients.

166

T. Jahan et al.

The sensitive attribute is transformed into a distorted data. This distorted data is published protecting privacy of an individual. The original dataset has sensitive information about patient is perturbed with S-based fuzzy membership function. In our experiments we have used Tanagra data mining tool for classification, k-means clustering is implemented in JAVA and performance is checked using MATLAB package.

13.6.1 Proposed Method Step 1: An authorized user owns an original dataset (D). Step 2: A original dataset having sensitive attributes is perturbed using S-based fuzzy membership function (fuzzy data) (D). Step 3: The fuzzy data (D) is published by a user to an analyst for analysis. Step 4: Analyst receives the fuzzy data and performs mining techniques. The different classifiers used are SVM, ID3 and C4.5. The data before and after perturbation is relatively same, proved by mining utility. “Accuracy of a classifiers is simply, a ratio of ((no. of correctly classified examples)/(total no. of examples)) *100)”. Technically it can be defined accuracy ¼

TPþTN ðTPþFNÞ þ ðFPþTNÞ

An experiment measuring accuracy of classifiers based on True Positives (TP), False Positives (FP) as per the above equation is tabulated in Table 13.1. The tabular form indicates the accuracy of original and perturbed dataset on classifiers SVM, ID3 and C4.5. The results tabulated indicate that classification performed on original data and perturbed data are equivalent. The Accuracy of classifiers and k-means clustering is shown in Figs. 13.3, 13.4 and 13.5. The results indicate that classification and clustering performed on original data and perturbed data are relatively equivalent. We have found that by using fuzzy approach, the processing time of data is considerably reduced when compared to the other methods that were used before. We have found that by using fuzzy approach, the processing time of data is considerably reduced when compared to the other methods that were used before.

ORIG DIST

Data CLASS

Iris dataset ID3 TP FP 0.98 0.02 0.98 0.02

SVM TP 0.97 0.97

FP 0.03 0.03

Table 13.1 Classification of datasets

C4.5 TP 0.99 0.99 FP 0.01 0.01

Hepatitis ID3 TP FP 0.92 0.07 0.92 0.07 SVM TP 0.92 0.92 FP 0.7 0.07

C4.5 TP 0.89 0.89 FP 0.10 0.10

Fertility data set ID3 SVM TP FP TP 0.88 0.12 0.88 0.87 0.13 0.86

FP 0.12 0.14

C4.5 TP 0.87 0.87

FP 0.07 0.07

13 A Comparative Study of Data Perturbation Using Fuzzy Logic to Preserve Privacy 167

168

Fig. 13.3 Fertility dataset

Fig. 13.4 Iris dataset

Fig. 13.5 Hepatitis dataset

T. Jahan et al.

13

A Comparative Study of Data Perturbation Using Fuzzy Logic to Preserve Privacy

13.7

169

Conclusion and Future Scope

The paper presents a fuzzy based approach to transform a original data to fuzzy data i.e. perturbed data. The method is proved an efficient maintaining privacy while it is published. The analyst is unknown by original values, hence preserving privacy of sensitive information owned by an authorized user. The results from our experiments shows that classification performed on original and perturbed data are relatively same. Fuzzy approach S-based membership function used has increased processing time of the algorithm used. In future we would like to extend our work other fuzzy membership functions such as triangular and use other classification and clustering data mining utilities for the proposed algorithm used in this paper.

References 1. V. Estvill–Castro, L. Brankovic, D.L. Dowe, Privacy in data mining. Australian Computer Society NSW Branch. Available at www.acs.org.au/nsw/articles/199082.html 2. Thanveer, G. Narasimha, C.V. GuruRao, Data Perturbation and Feature Selection in Preserving Privacy, in Proceeding of the 2012 Ninth International conference in Wireless and Optical Communication Networks(WOCN), (Indore). IEEE catalog number: CFP12604-CDR, ISBN: 978-1-4673-1989-8/12 3. Pengpeng Lin, Jun Zhang, Ingrid St. Omer, Huanjing Wang, Jie Wang, in A Comparative Study on Data Perturbation with Feature Selection. Proceeding of the international multiconference of engineers and computer scientists 2011, vol. 1 (IMECS, Hongkong, 2011), 16–18 Mar 2011 4. S.E. Fienberg, J. McIntyre, Data swapping: variations on a theme by Dalenius and Reiss. J. Off. Stat. 21, 309–323 (2005) 5. K. Muralidhar, R. Sarathy, Data shuffling a new masking approach for numerical data. Manage. Sci. 52, 658–670 (2006) 6. Y. Li, S. Zhu, L. Wang, S. Jajodia, Privacy enhanced micro aggregation method, in Proceedings of 2nd International Symposium on Foundations of Information and Knowledge Systems, 2002, pp. 148–159 7. Shuting Xu, Shuhua Lai, Fast Fourier Transform based data perturbation method for privacy protection, in Proceedings of IEEE Conference on Intelligence and Security Informatics, New Brunswick New Jersey, May 2007 8. S. Mukharjee, Zhiyuan Chen, A. Gangopadhyay, A privacy preserving technique for Euclidean distance-based mining algorithms using Fourier-related transforms, VLDB J. 15, 293–315 (2006) 9. Pinkas, Cryptographic techniques for privacy- preserving data mining, ACM SIGKDD Explorations, 4(2), 12–19 (2002) 10. B. Karthikeyan, G. Manikandan, V. Vaithiyanathan, A fuzzy based approach for privacy preserving clustering. J. Theor. Appl. Inf. Technol. 32(2), 118–122 (2011) 11. R. Agrawal, R. Srikant, Privacy–preserving data mining, in Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, San Diego, 2003, pp. 86–97 12. S. Xu, J. Zhang, D. Han, J. Wang, Data distortion for privacy protection in a terrorist analysis system, in Proceedings of the 2005 I.E. International Conference on Intelligence and Security Informatics, 2005, pp. 459–464

170

T. Jahan et al.

13. V. Vallikumari, S. Srinivasa Rao, KVSVN. Raju, KV. Ramana, BVS. Avadhani, Fuzzy based approach for privacy preserving publication of data. Int. J. Comput. Sci. Netw. Secur. 8(1), (2008) 14. L. Zadeh, Fuzzy sets. Inf. Control. 8, 338–353 (1965) 15. J. Timothy, Ross, Fuzzy Logic with Engineering Applications (McGraw Hill, New York/ Singapore, 1997) 16. T. Jahan, G. Narsimha, C.V Guru Rao, Privacy preserving clustering on distorted data in International Organization of Scientific Research. J. Comput. Eng. ISSN: 2278–0661, ISBN: 2278–8727 5(2), 25–29 (2012)

Lihat lebih banyak...

A Comparative Study of Data Perturbation Using Fuzzy Logic to Preserve Privacy

Descrição do Produto

Comentários