Diabetes_Prima_Analysis_Done_By_Weka-Tool

July 21, 2017 | Autor: Aamer Khan | Categoria: Data Mining
Share Embed


Descrição do Produto

1. Analyse the attributes in the data, and consider their relative
importance with respect to the target class.
The dataset is diabetes.arff dataset provided with Weka and the title is
"Pima Indians Diabetes Database".
What is an Attribute?

Each individual, independent instance that provides the input to machine
learning is characterized by its values on a fixed, predefined set of
features or attributes.


We have an instance with different attributes and a class. These attributes
can be either discrete (nominal) or continuous (numeric). It can be seen
that pima_diabetes is a dataset. The total numbers of instances are 768 and
total numbers of attributes are 8 while the last one is known as a class.

8 of the attributes are continuous (numeric) while class is discrete
(nominal).

There are 2 values for the class and the labels of these values give some
indication what this dataset is about. According to the figure above, the
labels are tested_negative and tested_positive. The blue bar graph means
that 500 patients have no diabetes and 268 patients have diabetes. The type
is nominal. In discrete (nominal), it will be either yes or no only. It is
also known as classification.

This is a dataset for Number of times pregnant and it shows the minimum
value which is equal to 0 and maximum value which is 17. The type is
numeric. In continuous (numeric), the value is predicted rather than yes
or no. it is also known as regression. According to the graph, the number
of patients which are not having diabetes during pregnancy is greater than
the number of patients which are having diabetes during pregnancy. It can
be seen that when number of readings are less, the number of patients with
diabetes are more and vice versa. There is no positive correlation between
number of pregnancies and diabetes.


According to this graph above plasma glucose concentration in a 2 hour oral
glucose tolerance test ranges from 0 as its minimum value to a maximum
value of 199 that indicates a true diabetes patient. Normal value of
plasma glucose concentration is 136 or below. The unique patients affected
are about 2 % which have very low level of plasma glucose. The mean value
is 120 approx. And when the plasma glucose concentration is more than or
equal to the mean value, the number of readings are less and the number of
patients affected by diabetes are more and vice versa.


According to the graph above, the mean value of diastolic pressure is about
69 and the maximum value is approximately 122. The patients having
diastolic blood pressure of approx. 40 or below are exceptional cases which
are about 1 % and are more affected by diabetes. On the other hand,
patients having diastolic blood pressure of 60 or above have more chances
of being affected by diabetes and number of affected patient's increases
with increase in diastolic blood pressure.


According to the graph above, the range lies between 0 and 99 and the mean
is 20. Triceps skin fold thickness between 0 and 31 is normal which means
less number of patients is affected by diabetes. The number of patients
affected by diabetes with triceps skin fold thickness of about 31 to 44 is
approximately equal to the patients unaffected by diabetes. With skin fold
thickness of about 40-49 the diabetes affected patients ratio is more. From
50-56 skin fold thickness the diabetes patients are very few. And from
triceps skin fold thickness in range 56-99 is an exceptional case. Hence
the graph shows that diabetes and Triceps skin fold thickness are not
correlated to each other.

2-Hour serum insulin (mu U/ml)
According to the graph above, the range is 0-846 for serum insulin and the
mean value is 79. In the range 0-134, numbers of patients affected with
diabetes are less as compared to the patients unaffected with diabetes.
Between serum level of 134 and 222, number of affected and unaffected
patients is approximately the same. And from serum insulin level of 222 and
onwards the number of patients affected by diabetes increases. From the
range 489-624 it is a unique case and involves about 12 %. Hence Serum
Insulin and diabetes strongly correlated to each other.

Body mass index (weight in kg/(height in m)^2)

According to the graph shown above, the minimum value is 0 and the maximum
value is 67.1 while the mean 31. When the body mass index approaches 30 the
number of patients affected by diabetes increases. And as the body mass
index further increases the number of patients affected increases
simultaneously. Because there is a fixed body mass value for any individual
and as this value is exceeded it results in diabetes. There are about 10%
unique patients who are exceptional cases whose diabetes is not related to
their body mass index. It's expected that their diabetes can be because of
any other disease or abnormality.

Diabetes pedigree function
According to the graph shown above, the minimum value is 0.08 and the
maximum value of the graph is 2.42 while the mean of the graph is 0.472. It
shows that the people having less or no family history of diabetes are not
that much affected by diabetes while the people having family history of
diabetes have more number of chances to get diabetes. But as we can see
from the graph that there are about 55% chances of diabetes because of
hereditary factors and about 45% of people have unique cases as they don't
get diabetes even if they have positive family history of diabetes. Hence
we can conclude from the graph that hereditary factors have about 50%
effect on patients to suffer with diabetes.



Age (years)
According to the graph shown above, Age has a strong relation with
diabetes. The minimum age value in this graph is 21 and maximum value is 81
and the mean is 33. When the age is 21 to 27 (approx.) the number of
patients unaffected by diabetes is more as compared to the patients
affected but as the age approaches 30 or more the ratio of patients
affected by diabetes increases. It is mainly because in elderly people the
immune system is weak. And in the graph there is unique percentage of about
1% which is exceptional cases. Hence the Age and diabetes are directly co-
related to each other.















2. Construct graphs of classification performance against training set size
for a range of classifiers taken from those considered in the module. You
may need to experiment with different training sets, depending on what you
have discovered about the data in step (1).

(I ANALYSED THE DATASET AS I HAVE FILTERED 5 ATTRIBUTES TO STUDY ABOUT THE
DIABETES BUT I HAVE SHOWN THE WORKING OF 9 ATTRIBUTES TOO)
With 9 Attributes



Figure 1 SVM Percentage Split 10%

Figure 2 SVM Percentage Split 20%

Figure 3 SVM Percentage Split 30%

Figure 4 SVM Percentage Split 40%

Figure 5 SVM Percentage Split 50%

Figure 6 SVM Percentage Split 60%

Figure 7 SVM Percentage Split 70%

Figure 8 SVM Percentage Split 80%

Figure 9 SVM Percentage Split 90%


Figure10 j48 Percentage Split 10%

Figure 11 J48 Percentage Split 20%

Figure 12 J48 Percentage Split 30%

Figure 13 J48 Percentage Split 40%

Figure 14 J48 Percentage Split 50%

Figure 15 J48 Percentage Split 60%

Figure 16 J48 Percentage Split 70%

Figure 17 J48 Percentage Split 80%

Figure 18 J48 Percentage Split 90%


Figure 19 MLP Percentage Split 10%

Figure 20 MLP Percentage Split 20%

Figure 21 MLP Percentage Split 30%

Figure 22 MLP Percentage Split 40%

Figure 23 MLP Percentage Split 50%

Figure 24 MLP Percentage Split 60%

Figure 25 MLP Percentage Split 70%

Figure 26 MLP Percentage Split 80%

Figure 27 MLP Percentage Split 90%

Figure 28 Naïve Bayes Percentage Split 10%

Figure 29 Naïve Bayes Percentage Split 20%

Figure 30 Naïve Bayes Percentage Split 30%

Figure 31 Naïve Bayes Percentage Split 40%

Figure 32 Naïve Bayes Percentage Split 50%

Figure 33 Naïve Bayes Percentage Split 60%

Figure 34 Naïve Bayes Percentage Split 70%

Figure 35 Naïve Bayes Percentage Split 80%

Figure 36 Naïve Bayes Percentage Split 90%


………………………………………………………………………………………………………..


After filtering With 5 attributes i.e pregnancy, mass index, pedigree
function, age and a class


Figure 1 SVM Percentage Split 10%

Figure 2 SVM Percentage Split 20%


Figure3 SVM Percentage Split 30%

Figure 4 SVM Percentage Split 40%

Figure 5 SVM Percentage Split 50%


Figure 6 SVM Percentage Split 60%

Figure 7 SVM Percentage Split 70%

Figure8 SVM Percentage Split 80%

Figure 9 SVM Percentage Split 90%

Figure 10 J48 Percentage Split 10%

Figure 11 J48 Percentage Split 20%

Figure 12 J48 Percentage Split 30%

Figure 13 J48 Percentage Split 40%


Figure 14 J48 Percentage Split 50%

Figure 15 J48 Percentage Split 60%

Figure 16 J48 Percentage Split 70%

Figure 17 J48 Percentage Split 80%


Figure 18 J48 Percentage Split 90%


Figure 19 NAVIE BAYES Percentage Split 10%

Figure 20 NAVIE BAYES Percentage Split 20%

Figure 21 NAVIE BAYES Percentage Split 30%

Figure 22 NAVIE BAYES Percentage Split 40%

Figure 23 NAVIE BAYES Percentage Split 50%

Figure 24 NAVIE BAYES Percentage Split 60%

Figure 25 NAVIE BAYES Percentage Split 70%

Figure 26 NAVIE BAYES Percentage Split 80%

Figure 27 NAVIE BAYES Percentage Split 90%


Figure 28 MLP Percentage Split 10%

Figure 29 MLP Percentage Split 20%

Figure 30 MLP Percentage Split 30%

Figure 31 MLP Percentage Split 40%

Figure 32 MLP Percentage Split 50%

Figure 33 MLP Percentage Split 60%

Figure 34 MLP Percentage Split 70%

Figure 35 MLP Percentage Split 80%

Figure 36 MLP Percentage Split 90%

Table 1 Different performance metrics running in WEKA (With 9 attributes)

Table 2 Different performance metrics running in WEKA (With 5 attributes)













Table 3 Error measurement for different classifiers in WEKA (with 9
attributes)

Table 4 Error measurement for different classifiers in WEKA (with 5
attributes)

Table 5 Performance measuring in training and test data set using WEKA
(with 9 attributes)

Table 6 Performance measuring in training and test data set using WEKA
(with 5 attributes)






ALL GRAPHS ARE FOR 5 ATTRIBUTES



Graph 1 Percentage Split 10-90 vs Mean Absolute Error

Graph 2 Percentage Split 10-90 vs Root Mean Square Error


Graph 3 Percentage Split 10-90 vs Relative Absolute Error


Graph 4 Percentage Split 10-90 vs Root Relative Squared Error


Graph 5 Percentage Split 10-90 vs Accuracy

Graph 6 Percentage Split 10-90 vs Error Rate


Graph 7 Percentage Split 10-90 vs Time (s)

Graph 8 Percentage Split 10-90 vs Kappa Statistics


3. Analyse the data structure/representation generated by at least three
classifiers when trained on the complete dataset. What does your
analysis tell you about the data set?


The diagrams, tables and a graph are made by using different classifiers.
The classifiers which are used for the interpretation are J48, MLP, Naïve
Bayes and SMO. There are many test options which are as follows:
Use training set:-
This should be chosen if the actual data set is used as training and
testing set.
Supplied test set:
It is an option if the actual data set is used as training set and you have
got a separate testing set.
Cross-Validation:
Cross-Validation provides the opportunity to use one data set. It splits
the data set into m folds and use m-1 folds as training sets and one fold
as testing set.
Percentage split:
Allows to split on n percentage the actual data set into training and
testing set.

Percentage split (10,20,30,40,50,60,70,80,90) is used. Table 2 is made for
easier analysis and evaluation. Different performance matrix like TP rate,
FP rate, Precision, Recall, F-measure and ROC are presented in numeric
value during training and testing phase. In Table 4, different types of
error measurement like mean absolute error and root mean squared error, the
time taken to build in seconds and KAPPA statistics. Finally, Graphs are
made to make it more easier to understand.

Now let's start with SMO classifier. According to Figure 1(WITH 9
ATTRIBUTES), the correctly classified instances are approximately 69% and
incorrectly classified instances are approximately 31%. The confusion
matrix states that 366 As are correctly classified as As whereas 89 Bs were
incorrectly classified as As and126 Bs are incorrectly classified as As
whereas 110 Bs are correctly classified as Bs. The kappa statistic shown is
0.2811 and ROC Area is 0635. Kappa statistics is used to assess the
accuracy of any particular measuring cases, it is usual to distinguish
between the reliability of the data collected and their validity. A kappa
of 1 indicates perfect agreement, whereas a kappa of 0 indicates agreement
equivalent to chance. 0.60-0.70 is acceptable figure.
The rest of the figures (remaining figures of 9 attributes and of 5
attributes) can be easily interpreted as explained above.

Performance:-
Performance should be analysed in two ways. The ability of each classifier
to generalise is compared in a table. This will tell that which classifier
is better than other classifier. The second way of analysing performance is
to study the pattern of errors. The total time required to shape the model
is also an essential parameter in comparing the classification algorithm.

According to Table 4, SMO is the best because of lower error rate and
Second best is MLP. Naïve Bayes is on third Number and J48 is on Fourth
which means worst algorithm.
According to Table 6, Naïve Bayes classifier requires the shortest time
which is around 0.011 whereas J48 is on second with 0.014. MLP algorithm
requires the longest model building time which is around 0.37 seconds.

4. Combine the results from the previous three steps and all your
classifiers to develop a model of why instances fall into particular
classes. (Your answer to this question should be understandable by someone
who is not a specialist in data mining.)

According to the graphs and my analysis, there are some attributes which
are the causes of diabetes and some of them are effects of diabetes. Few of
them neither are the cause of diabetes nor the effect of diabetes. Let's
start with pregnancy; one of the causes of diabetes is Pregnancy. There are
increased chances of gestational diabetes if women had symptoms of diabetes
during her previous pregnancy. It is caused by a change in the way a
woman's body responds to the hormone insulin during her pregnancy. As the
number of times pregnancy increases, then the chance of diabetes goes up
with it establishing a direct correlation between pregnancy and diabetes.
As age increases, a chance of increase in diabetes is observed. Diabetes is
mostly observed in elderly people. One of the reason of diabetes in elderly
people is weak immune system because of lack of exercise, proper diet, co-
existing health issues and cognitive complications. Diabetes pedigree
function is also an attribute that contributes in diabetes progression.
People having diabetes in family history have significantly increased
chances of having diabetes in any part of their life. Body mass index has a
specific value for individual of any age and is one of the main factors
contributing to diabetes. Because of obesity many problems arise. Obesity
causes abnormal glucose tolerance in the body that leads to diabetes. Most
of the people get diabetes because their weight is more than their healthy
weight range.
There are some attributes which are the effects of diabetes. Let's talk
about blood pressure; Diabetes is the one of the main causes that's leads
to high blood pressure. Diabetes plays a role in damaging arteries and
makes their target for hardening. Hardening of arteries cause pressure in
arteries hence causes high blood pressure. Chances of having low blood
pressure for a patient having diabetes are very few. On the other hand,
overweight is also a factor which causes blood pressure. Body mass index is
also related to skin fold thickness as the body mass index increases the
sin fold thickness increase. The serum insulin and plasma glucose
concentration are the tests which are always taken in case of diabetes. If
the plasma glucose concentration of a patient is more than 136 (approx.)
the patient is likely to have diabetes but if the patient has plasma
glucose concentration of 199 or more he is confirmed to be a diabetes
patient. Serum insulin is also a test used to check diabetes in a patient.
So if there is diabetes, these two tests are used to know how much the
diabetes is and is present or not. When we know the stage of diabetes by
the help of these tests we can easily find a way to treat the patients to
overcome the problem of diabetes.
Lihat lebih banyak...

Comentários

Copyright © 2017 DADOSPDF Inc.