A DATA MINING APPROACH TO AVOID POTENTIAL BIASES

Share Embed


Descrição do Produto

International Journal of Computer Engineering & Technology (IJCET) Volume 6, Issue 7, Jul 2015, pp. 27-34, Article ID: IJCET_06_07_004 Available online at http://www.iaeme.com/IJCET/issues.asp?JTypeIJCET&VType=6&IType=7 ISSN Print: 0976-6367 and ISSN Online: 0976–6375 © IAEME Publication ___________________________________________________________________________

A DATA MINING APPROACH TO AVOID POTENTIAL BIASES S. Malpani Radhika PG Student, Department of Computer Engineering, JSPM NTC, Savitribai Phule Pune University, Pune, India Dr. Sulochana Sonkamble Professor & Head, Department of Computer Engineering, JSPM NTC, Savitribai Phule Pune University, Pune, India ABSTRACT Data Mining is a science of identifying the pattern (Knowledge) from a large collection of data. The identified pattern serves the Decision Support System to generate Classification Rules that help in Decision Making. Potential biases and potential privacy invasion are some of the possible negative outcomes in the data mining technique. Rule mining of categorization technique has enclosed the way for making automatic decisions like accepting or rejecting the request of loan, etc. The biased (discrimination) result may be generated when the training data set are biased. Therefore in data mining anti-biasing techniques with discovery and prevention of biases is proposed by the researchers. Biases are of two types direct and indirect. The direct biases are occurred when the decisions are made on sensitive attribute. Indirect biases are occurred when the decisions are made on the basis of non-sensitive attributes. Number of researchers work for the prevention of direct or indirect biases. The existing method works on the preprocessing approach for prevention of biases and uses the method of rule protection and rule generalization. The preprocessing technique has the limitation i.e. for preventing the biased rule; the method directly removes the rule from the database, which decreases the quality of the database. In this paper the system is proposed for preventing data from the biased rule by using the Postprocessing method. It produces a much smaller set of high quality predictive rules directly from the dataset. Key words: Extended-CPAR, Direct Biases, Indirect Biases, Post-processing, Preprocessing, Measurement of Biases, Transformation of Biases and Direct Rule Protection (Method 1). Cite this Article: Radhika, S. M. and Dr. Sonkamble, S. A Data Mining Approach To Avoid Potential Biases. International Journal of Computer

http://www.iaeme.com/IJCET/index.asp

27

[email protected]

S. Malpani Radhika and Dr. Sulochana Sonkamble

Engineering and Technology, 6(7), 2015, pp. 27-34. http://www.iaeme.com/IJCET/issues.asp?JTypeIJCET&VType=6&IType=7 _____________________________________________________________________

1. INTRODUCTION In the communal intellect, the word biased refers to an action which leads to unfair decision making towards people on the basis of their membership to a group, without regard to the personage merit. For an instance, U. S. federal laws forbid biases based on the race, color, religion, nationality, gender, marital status, and age in a number of setting, including: scoring of credit or insurance: sale, rental etc [1]. By considering the side of researchers, the problem of biasing in credit, finance, insurance, labor market, education and other human actions which has been focused by many of the researchers of the human and economics science. The technology of data mining is a source of both, generating biased decisions and a method for determining and preventing the biases. The term direct bias means the apparent when a user is indulgenced adversed as his or her individual attribute are sensitive like sex, race, age, disability or the marital status [1]. This type of biased is simple and can influence the person being biased seriously. The term indirect biased occurs when a confident rule to deal with all people equally but has the result of affecting a number of certain people. Services in the information society allow and custom set of large amount of data. These data are used to training association or categorization rules in view of making automatic decision, like loan accepting or rejecting, insurance premium calculation, personnel selection, etc. Automating decisions may give a sense of fairness: categorization rules do not guide themselves by personal preference. Though, at a closer look, one understands that categorization rules are learned by the system from the training data. If the training data are intrinsically biased for or against a particularly community, the learned model may show a biased intolerant activities. In another words, the genuine reason behind denying the loan is that the person belongs to other nationality. Therefore there is a need to remove such a potential biases from the training data without affecting the decision making utility. Everyone wants to prevent their data from becoming the source of biases, due to data mining tasks generating biased model from biased data sets as a part of automated decision making. In [4], its concluded data mining can be both a source of biased and a resource for discovering biased. Hence some techniques to avoid biases have to be introduced and they should be revised to achieve more accuracy and allow the DSS to make biased free decision based on the biased free dataset [1]. In this paper we discussed about preventing the biased in the dataset. For preventing the dataset we introduced two methods first one is Post-processing method which prevent the biased rule by generating strong rules from the input dataset. Second method is the Categorization with least biased method this method also prevent the data from generating the biased rule. We will discuss the implementation details of the proposed system in details in the further sections. The remaining paper is organized in the following manner. In section II we discussed about the related work done by the researchers for preventing the database from the bias rule. In section III we discussed about the implementation details of the proposed system. In this section we discussed about the system overview, algorithms of the proposed system. In section IV discussed about the result and discussion of the http://www.iaeme.com/IJCET/index.asp

28

[email protected]

A Data Mining Approach To Avoid Potential Biases

proposed system. In section V we discussed the conclusion of the proposed system and finally we discussed the references used for the paper.

2. RELATED WORK 2.1. Literature Review Despite of the tremendous enhancement of information system on the basis of data mining technology in the decision making, the problem of anti-biasing in the data mining didn’t receive too much of attention. Now the work done by the researcher for detection and measuring the biases that occur in the data mining technology is discussed. Also related work done for the preventing the bias in the data mining is considered. In [1], the authors represent two novel algorithms for solving the issues which are essentially distinct from the recognized algorithms. The algorithms are Apriori and AprioriTid. The features of the best two algorithms are shared to form the hybrid algorithm known as AprioriHybrid. The Apriori begins with calculating the number of item occurrences in every pass. Then the sample item sets are generated and the support of candidates in each sample item set is evaluated. The distinguishing feature of AprioriTid is that, it does not calculate the support after every pass. In [2], the author describes the framework for evaluating potential biases by analyzing the previous decision records generated out of sensitive attributes and also addresses the issues regarding determining an accurate measure of the degree of bias from a known group in a known context with respect to the decision. The author considers this problem is rearticulated in a classification rule based setting, and a compilation of quantitative measures of bias is introduced based on the existing norms and regulation. Few measures to calculate the potential biases i.elift, olift, slift formulas, are introduced in this work. In [3], the authors introduced the issue of generating the biases through data mining in the data set of traditional decision records, selected by the user or by the system. Author honor the direct and indirect biases discovered by the modeling protected by law groups and contexts where biases occurs in the classification rule based syntax. In [4], the authors introduced and studied the idea of bias classification rules. Providing an assurance of non bias is shown to be a non trivial task. The authors also introduced the term” Biases in dataset” for the first time. In data mining, classification models are constructed on the basis of historical data, hence if there were some biased decision making done previously, then the classification studied by the classification models will also be biased. Hence the research focused on identifying the sensitive attributes that could contribute to biases decision making. This idea lead to the term called” Direct-Bias prevention”. In addition the” inference model” to tackle with indirect biases was also introduced. The inference model suggested a secondary database to be maintained along with original dataset called as “Background Knowledge”. In [5], the author guides the people through the legal problems about the biases hidden in data, and through distinct legally grounded analyses to unveil biases circumstances. The authors say that DCUBE is an analytical tool supporting the interactive and iterative procedure of detecting potential biases. The future users of DCUBE include: anti-bias establishment, proprietors of socially sensitive decision databases, and auditors, researchers in social sciences, economics and law.

http://www.iaeme.com/IJCET/index.asp

29

[email protected]

S. Malpani Radhika and Dr. Sulochana Sonkamble

In [6], the authors represent the model for ruling proof of biases in datasets of traditional decision records in communally responsive tasks, including access to credit, mortgage, insurance, labor market and other benefits. The authors presented a reference model for the examination and revelation of biases in socially-sensitive choices taken by DSS. The methodology comprises first of extracting frequent classification rules, and afterward of examining them on the premise of quantitative measures of Biases and their measurable significance. The key legitimate ideas of protected-by-law groups, direct biases, indirect biases, honest to goodness occupational prerequisite, affirmative activities and partiality are formalized as explanations over the set of concentrated runs and, perhaps, extra foundation information. In [7], the authors begin with addressing the issue of bias rules occurred in the dataset by introducing the novel classification method for learning non-bias technique on the basis of training data. This method is based on the manipulating the dataset by creating the least intrusive modification which lead to an unbiased dataset. In [8], the authors examine and study how to adjust the naive Bayes classifier for performing classification which has restricted to be sovereign with respect to a given sensitive attributes. In [9], the authors discussed how to spotless training datasets and outsourced datasets in such a way those rightful classifications rules can still be remove but biased rules on the basis of sensitive attributes cannot. The authors analyzed how biased decision making could affect on cyber security applications, particularly IDSs. IDSs use computational knowledge advances, for example, data mining. It is evident that the training data of these frameworks could be capable of generating biases, which would bring about them to settle on such decision when foreseeing interruption. In [10], the authors discussed a novel preprocessing technique for indirect biased prevention on the basis of data transformation which can consider distinguishes biased attributes and their mixture. Additionally some measures for assessing their proposed technique in term of its success in biased prevention and its impact on the data quality. In [11], the Adult Dataset is provided; this data set consists of 48,842 records, split into a “train” part with 32,561 records and a “test” part with 16,281 records. The data set has 14 attributes (without class attribute).

2.2. Existing System The existing system used preprocessing approached for preventing direct and indirect biases in the dataset. The existing system divided up into two phase: 2.2.1. Measurement of biases Direct and indirect biased recognition contains obtaining the alpha biasing rules and redlining rules. Initially, potentially biasing rules and potentially nonbiasing rules are based on the biased items present in the database DB and FP the frequent categorization rule. After that by using the direct biased measures and the biased threshold the direct biased is measured by obtaining the alpha biasing rules with the potentially biasing rules. After that, same as the direct bias, indirect bias is measured by obtaining the redlining rules with the potentially non biased rules merging with the background knowledge, using an indirect biased measures and the bias threshold [1].

http://www.iaeme.com/IJCET/index.asp

30

[email protected]

A Data Mining Approach To Avoid Potential Biases

2.2.2. Transformation of biases Transforming the original database DB in such way that direct or indirect biases are eliminate, with least impact on the data and on rightful decision rules, so that no unfair decision rule can be mined from the transaction database[1]. 2.2.3. Algorithms used in the preprocessing approach [1] We consider the class assume in the database is binary. We consider in FP with the negative classification rule. Also we consider the biased item set (A’) and non biased item set (D) to be binary or non-binary category. i) Direct Biases Prevention Algorithm: • • •

Direct Rule Protection (Method I) Direct Rule Protection (Method II) Direct Rule Protection and Rule Generalization

ii) Indirect biases Prevention Algorithm:

3. IMPLEMENTATION DETAILS 3.1. System Overview In the Figure 1, we discussed the proposed system. In the proposed system we introduced the method for preventing the bias from the database. Here we discussed two methods which are Post-processing method i.e. Extended-CPAR for removing bias and Categorization with least Biased Rule algorithm and initially user uploads the dataset which contain biased rules. Initially we prevent the biased rule by using Postprocessing method. This algorithm merges the benefits of both the associative categorization and traditional rule based categorization. This method has basically splits into three steps: • • •

Rule generating Estimation accuracy of rule Categorization and rule analysis.

Figure 1 System Architecture

Here we introduced the method for preventing the bias by using Categorization with least bias. This method changes the allocation of distinct data objects for a given data to make it biased free. The basic plan is that the data object nearest to the decision boundary are more prone to be sufferer of biased. Therefore the main

http://www.iaeme.com/IJCET/index.asp

31

[email protected]

S. Malpani Radhika and Dr. Sulochana Sonkamble

purpose is to alter the distribution of these borderline objects to make the biased free. On the original dataset the ranking function was applied, for identifying the dataset nearest to the bias data. The basic steps of this algorithm are as follows: • • •

Check the eligibility criteria Rank the client based on number of eligibility criteria they satisfied The cancellation is applied to those who having lower rank.

3.2. Algorithm Algorithm 1: Algorithm for Post-processing (Extended-CPAR) 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13.

Let D be the original data set. Declare a counter Weight. Each attribute is assigned Weight=1. Declare set gain that holds the values of strong attributes. Initially set Rule is NULL. Declare the set “Result” which holds the result obtained by applying the traditional techniques. Declare the set “Negative” which holds the attributes not included in Result. if Weight (Negative) > Weight (Result) then. Evaluate gain for each attribute in Negative. While for each attribute if gain is strong then. change its Class Attribute ¬c to C Add the new Classification rule in Rules set Include such attributes in result set.

4. RESULTS AND DISCUSSION 4.1. Expected Experimentation The system is built using Java framework (version jdk 8.1) on Windows platform. The Netbeans (version 8.0.1) is used as a development tool. The system does not require any specific hardware to run; any standard machine is capable of running the application.

4.2. Results Here we discussed the results and generated graph for the Proposed System. While the K Value = 3, Minimum Best Gain = 0.3, Total Weight Factor = 0.6, Gain Similarity Ratio = 0.4, the discrimination removal is as follows: Here in Graph 1, the Red Bar shows Existing System and the Blue bar shows Proposed System. In the Existing System the Pre-Processing Algorithms have been implemented whereas in Proposed System the Post-Processing Algorithms have been implemented. Potential biases discovered in Existing System ranges between 0.1 to 2.5% whereas the potential biases discovered in Proposed System is 15%.

http://www.iaeme.com/IJCET/index.asp

32

[email protected]

A Data Mining Approach To Avoid Potential Biases

Graph 1 Degree of Potential Biases Removed

Graph 2, shows memory required for the respective algorithms to execute.

Graph 2 Memory Requirement

5. CONCLUSION In this paper we discussed regarding the biased rule generated in the database. There are two types of biased rule which are direct biased rule or indirect biased rule.

http://www.iaeme.com/IJCET/index.asp

33

[email protected]

S. Malpani Radhika and Dr. Sulochana Sonkamble

Number of techniques was developed for preventing the biased rule. In this paper we proposed the Post-processing method which prevents the biased rule from the database. We proposed Post-processing algorithm for preventing the biased rule. The Existing System (Pre-processing) identifies only 2.5% of the dataset as discriminated (biased) where as the Proposed System (Post-Processing) identifies up to 15% of the dataset as biased while generating lesser but strong classification rules. The proposed system thus is an excellent solution for avoiding biases in Data Mining.

REFERENCES [1]

[2]

[3] [4] [5]

[6]

[7] [8]

[9]

[10]

[11]

Hajian, S. and Domingo-Ferrer, J. A Methodology for Direct and Indirect Discrimination Prevention in Data Mining. IEEE Transactions on Knowledge and Data Engineering, 25(7), July 2013. Pedreschi, D., Ruggieri, S. and Turini, F. Measuring Discrimination in SociallySensitive Decision Records. Proc. Ninth SIAM Data Mining Conf. (SDM 09), 2009, pp. 581–592. Ruggieri, S., Pedreschi, D. and Turini, F. Data Mining for Discrimination Discovery, ACM Trans. Knowledge Discovery from Data, 4(2), 2010, article 9. Pedreschi, D., Ruggieri, S., and Turini, F. Discrimination aware data mining. In Proc. of KDD 2008, ACM, 560568, 2008. Ruggieri, S., Pedreschi, D. and Turini, F.DCUBE: Discrimination Discovery in Databases. Proc. ACM Intl Conf. Management of Data (SIGMOD 10), 2010, pp. 1127–1130. Pedreschi, D., Ruggieri, S. and Turini, F. Integrating Induction and Deduction for Finding Evidence of Discrimination, Proc. 12th ACM Intl Conf. Artificial Intelligence and Law (ICAIL 09), 2009, pp. 157–166. Kamiran, F. and Calders, T. Classification without Discrimination. Proc. IEEE Second Intl Conf. Computer, Control and Comm. (IC4 09), 2009. Calders, T. and Verwer, S. Three Naive Bayes Approaches for DiscriminationFree Classification. Data Mining and Knowledge Discovery, 21(2), 2010, pp. 277–292. Hajian, S., Domingo-Ferrer, J. and Martnez-Balleste, A. Discrimination Preventionin Data Mining for Intrusion and Crime Detection. Proc. IEEE Symp. Computational Intelligence in Cyber Security (CICS 11), 2011, pp. 47–54. Hajian, S., Domingo-Ferrer, J. and Martnez-Balleste, A. Rule Protection for Indirect Discrimination Prevention in Data Mining. Proc. Eighth Intl Conf. Modeling Decisions for Artificial Intelligence (MDAI 11), 2011, pp. 211–222. Kohavi, R. and Becker, B. UCI Repository of Machine Learning Databases, 1996, http://archive.ics.uci.edu/ml/datasets/Adult.

http://www.iaeme.com/IJCET/index.asp

34

[email protected]

Lihat lebih banyak...

Comentários

Copyright © 2017 DADOSPDF Inc.