A Novel Semantic Search Technique Using Ontology Mahnaz Khani1, Mohammad V. Malakooti2, Mehran Mohsenzadeh3 Graduate Student of Department of Computer Engineering, Islamic Azad University, UAE Branch, Dubai, UAE
[email protected], 2 Faculty and Head of Department of Computer Engineering, Islamic Azad University, UAE Branch, Dubai, UAE
[email protected] 3 Faculty of Department of Computer Engineering, Islamic Azad University, Science & Research Branch, Tehran, Iran
[email protected], 1
ABSTRACT At this method using mutual information for selecting the size of Word document vectors for each category has been created according to the word of choice. Documents obtained by using matrix measure of similarity between documents are derived. Document similarity matrix of the inner product matrix document, the terms related to the size of the document is obtained. The degree of relevance of document types and document tests using vector inner product and the weighted average of those obtained from experiments. The accuracy our method is better than previous methods based on our simulations and evaluations. Also, when we use the ontology precision value is 1.
For example, “asks”, “asked”, and “asking” are all reduced to “ask 3. The words, which are happened less than five times in test set, are deleted, because, the words which are happened only in some period are not dependable statistically. 4. The words, which have been used in a document, are deleted. 5. MI size of remaining words is obtained and 300 words are used for testing classifiers in each category which their size are higher than other words. MI (t i , C ) P(t i , C ) log 2 P(t i , C )
N c (t i ) NC
P(t i , c) P(t i ) P(c)
, P(t i )
N (t i ) NC
(1) PC
NC N
(2)
N c (ti ) : The number of word in category c
KEYWORDS: Ontology, TSR, Text Classification, Web Mining, Semantic Web.
N c : The number of all words in category c N (t i )
1 INTRODUCTION The main causes of work, Problem in Search, not get Complete and correct information and Low quality search results. We consider a matrix K x N, k is number of words and N is number of document. If there is a word in the document, its value is one. Otherwise it is zero. Then we calculate the matrix s and R. Weight of words and their scores are calculated. Finally, we use the ontology. If the word of ontology is closer to category we consider that word is true and positive. Otherwise, we consider that word is false. 2 THE MAIN STAGES OF OUR METHOD 1. Removal Stop Word such as “ a, the, an, that, so...” 2. Stemming(porter)
ISBN: 978-0-9853483-3-5 ©2013 SDIWC
: The number of word in set
N : Number of all words in set 6. One K x N matrix is taken into consideration in a way that K is number of words and N refers to the number of documents in each layer. This matrix is exponent matrix of document and displays binary weights [1, 0] of words in documents. If word "ti" is observed in di document, one value is considered, otherwise, zero or null value is displayed. 7. Making S document similarity matrix. 8. Relation between word and document in R matrix is obtained by internal multiplication of A matrix in S matrix. 9. Then, R matrix is multiplied in vector I and Vc vector is obtained for category C. 10. At this part, Vc is normalized through the application of vector average weight method. 11. Weight of word i in category C is obtained by:
303
WCi vci log 2
c cf i
(3)
c = is the number of categories.
classified as true positive but our way is the only Wheat classified as true positive. Because it is a semantic and ontology nearly corn.
cf i : is the number of vectors which include word ti. 12. By multiplication of d new test document in WCi ,
4 EVALUATION WITH MICRO-AVERAGING
test document relation degree to category c is obtained. 13. Using Ontology, if rank of some words is turned positive to category C, based on used ontology, the word, which is nearer to category C semantically, is categorized positive and will be considered correctly and the rest words, which are nearer to category C, is categorized positive and are considered incorrectly. Figure 1 Process, which was summarized for testing various learning algorithms, is shown.
At this method, micro-averaging has been used for evaluation of proposed method.
Score(c, d new ) d new.wc
Pr ecision
Re call
c
i 1
TPi
TPi i 1 FPi i 1
c
c
c
i 1
c
i 1
TPi
TPi i 1 FN i c
(4)
Text files Index server
Find similar
Word counts per file Feature selection Data set Learning methods
Decision tree Naïve bayes Bayes net
SVM
Proposed method
Figure 1. Learning Algorithms Test Process Table 1. Example Score in corn
λ=0.87
Interest
0.92
TP
FP
Rate
0.89
TP
FP
Term
Ontology
TPi: number of words classified as true and Positive in category Ci. FNi: number of words classified as false and negative in category Ci. FPi: number of words classified as false and Positive in category Ci. |c|: Number of categories. 4.1 Implementation To implement our method, we use the SQL Server2008 and Delphi Version 7.we have used 10category data entitled "Reuters–21578 Apte Split" which includes 9603 Reports in training Group and 3999 Reports in testing Group for train and test our method. Table 2. 10-category data entitled "Reuters – 21578”
Wheat
0.91
TP
TP
3 EXAMPLE OF METHOD For example, we have three words that score in this category is corn. In the previous method, Because the score is more than a threshold, three words are
ISBN: 978-0-9853483-3-5 ©2013 SDIWC
Category Names
Number of Testing samples
Number of Training Samples
Earn
1087
2877
Acq
719
1650
Money-fx Grain
179 149
538 423
Crude
189
389
Trade
118
369
Interest Wheat
131 71
347 212
Ship
89
197
corn
56
182
304
Table 4. Average for 10 Categories of Reuters with Various Threshold
4.2 Comparison of Methods
0.9
0.87
0.8
0.7
0.6
0.5
0.4
0.3
0.2 Threshold
0.5728 0.5182 0.5818 0.5272 0.6454 0.6364
0.6454
0.6364
0.6454
0.6454
0.5546
0.5272
0.6454
0.6454
0.5909
0.5818
0.5454
0.6454
0.6728
0.5636
0.8818
0.5818
0.7091
0.6818
0.8546
0.6454
0.5728
0.8818
0.7546
0.8
0.7272
0.963
0.6454
0.6728
0.6364
0.8818
0.7728
0.9364
0.936
1
0.5272
0.6
0.6728
0.7546
0.9091
0.7728
0.9364
0.936
0.5818
0.6546
0.8636
1
0.9
1
1
1
0.6546
0.5636
0.6728
0.9364
1
1
1
1
0.5728
0.5636
0.8091
1
1
1
1
0.6364
0.5636
0.9364
1
1
1
0.5546
0.5636
1
1
1
1
0.6549
0.5636
1
1
1
0.5636
0.6
1
1
0.818
1
1994
1
Chang,Chen,2006
1
1995
1996
1
1971
1
Category
1
λ=78.0
0.972
Vapnik
corn
Chinkering, D.,
The proposed method
Sahami
Wheat
Multilabel Method
Ship
Linear SVM
Interest
Lewis, D.D.,
Decision trees
Trade
Rocchio
Bayes Nets
Crude
Naïve Bayes
Grain
Find similar
Acq
Method
Earn
Table 3. Breakeven Efficiency for Reuters – 21578 Apte Split 10 Categories
0.1
Categorie
s
The comparison results have been presented in table 3. Dataset that we have used the following methods is used, so the average of the proposed method compared with others.
With ontology
1997
95. 9 %
95. 8 %
97. 8 %
98. 0%
97.5%
60%
64. 7 %
87. 8 %
88. 3 %
89. 7 %
93. 6%
95.1%
100%
Money-
46. 7 %
56. 6 %
58. 8 %
66. 2 %
74. 5%
79.2%
100%
fx Grain
67. 5 %
78. 8 %
81. 4 %
85. 0 %
94. 6%
84.7%
100%
Crude
70. 1 %
79. 5 %
79. 6 %
85. 0 %
88. 9%
84.4%
100%
Trade
65. 1 %
63. 9 %
69. 0 %
72. 5 %
75. 9
85%
100%
Interest
63. 4 %
64. 9 %
71. 3 %
67. 1 %
77. 7%
81%
100%
Ship
49. 2 %
85. 4 %
84. 4 %
74. 2 %
85. 6%
85.4%
93.64%
Wheat
68. 9%
69. 7%
82. 7%
92. 5%
91. 8%
79.8%
93.64%
corn
48. 2 %
65. 3 %
76. 4 %
91. 8 %
90. 3%
78.2%
96.36%
Average
64. 6 %
81. 5 %
85 %
88. 4 %
92. 0%
91.3%
94.5%
4.3 Setting Threshold Limit There are various policies and strategies for determination of λ threshold depending on restrictions of application. The most important differentiation is this that: "Has threshold limit been obtained analytically or laboratory based?" • Analysis to determine the threshold called CSV or S-cut. • Experimentally determine threshold called Pcut • Precision and Recall values for the uniform adjustments to arrive at an appropriate value between these values called R-cut. • The threshold limit used in proposed method is based on [R-Cut] [Yang, 1999].
305
58%
60%
63%
68%
76.5%
82.5%
88.5%
94.5%
97%
99.5%
ISBN: 978-0-9853483-3-5 ©2013 SDIWC
Average
The following table of average, number of words classified as true and Positive in category Ci (TPi), number of words classified as false and negative in category Ci (FNi), Recall and Precision for 10 Categories of Reuters at Various Thresholds is shown.
Money
92. 9 %
Acq
-fx
Earn
Table 5. Recall, Precision, Average for 10 Categories of Reuters at Various Thresholds Threshold
0.9 0.87
0.8
0.7
0.6
0.5
0.4
FN1=9
FN1=9 TP1=1
TP1=1
FN1=9
TP1=1
TP2=3
FN2=8
FN2=8 TP2=2
TP2=2
FN2=9
TP2=1
TP3=8
FN3=7
FN3=8 TP3=3
TP3=2
FN3=8
TP3=2
FN4=6
TP4=4
FN4=6
FN4=8 TP4=4
TP4=2
FN4=8
TP4=2
FN5=8
TP5=2
FN5=9
FN5=9 TP5=1
TP5=1
FN5=9
TP5=1
FN6=2
TP6=8
FN6=4
FN6=7 TP6=6
TP6=3
FN6=9
TP6=1
TP7=4
FN7=8
TP7=2
FN7=8
FN7=8 TP7=2
TP7=2
FN7=8
TP7=2
TP8=5
FN8=8
TP8=2
FN8=9
FN8=9 TP8=1
TP8=1
FN8=9
TP8=1
TP9=3
FN9=7
TP9=3
FN9=7
FN9=7 TP9=3
TP9=3
FN9=7
TP9=3
TP10=3
FN10=7
TP10=3
FN10=7
FN10=7 TP10=3
TP10=3
FN10=8
TP10=2
TP1=1 FN3=2
TP6=8
FN10=7
FN2=7 TP5=4
FN9=7
FN1=9 TP4=10
FN8=5
TP3=9
FN7=6
TP10=4
TP2=6
FN6=2
TP9=4
TP1=1 FN5=6
TP8=7
FN3=1
TP7=5
FN2=4
TP6=9
FN1=9
TP5=7
FN10=6
TP4=10
FN9=6
TP10=7
TP3=10
FN8=3
TP9=4
TP2=8
FN6=1
TP8=9
TP1=1
FN5=3
FN7=5 TP7=5
FN2=2
TP6=10
FN10=3
FN1=9
TP5=10
FN9=6
TP10=9
TP1=2
TP2=10
TP3=10
TP4=10
TP9=9
FN10=1
FN1=8 TP1=5 FN1=5
TP3=10
FN8=1
FN9=1
TP10=
TP2=10
TP8=9
TP9=9
10
TP1=1
FN7=5
FN8=1
FN9=1
TP10=10
TP4=10
TP5=10
TP6=10
1 0.94
1 0.89
0.77 1
0.65 1
0.53 1
0.36 1
0.26 1
0.20 1
0.16
99.5%
97%
94.5%
88.5%
82.5%
76.5%
68%
63%
60%
58%
Average
1 0.99
306
ISBN: 978-0-9853483-3-5 ©2013 SDIWC
FN1=9
TP7=10
TP8=10
TP9=
TP6=10
TP8=
Interest
TP5=10
TP7=
Trade
TP4=10
TP6=
10
Crude
TP3=10
TP5=
10
Grain
TP2=10
TP4=
10
Money-fx
0.3
TP3=
10
Acq
TP7=10
TP2=
10
Earn
0.2
TP1=9
10
corn
10
Wheat
10
Ship
FN1=1
1
Precision
0.1
Categories
Recall
5 CONCLUSION At this method, micro-averaging is used for evaluation of our proposed method i.e. Recall and Precision. If rank of some words to category C is turned positive, based on used ontology, the word which is nearer to category C, is classified positive and is considered as correct and the rest words, which are classified positive, will be considered incorrect with regard to category C. At this stage, significance of ontology is specified and causes words to be classified in correct categories accurately and correctly with more precision. Laboratory results, shown in Table 3, we can observe that the presented method has high precision and accuracy than methods of Decision Tree, Bayes Nets, Naïve Bayes, Similar Find and SVM.
conference on machine learning, Nashville, USA, pp. 412--420 )1997(. 2. F, Sebastiani.: Machine Learning in Automated Text Categorization. ACM Computing Survey, 34(1): 1--47, March )2002(. 3. Yu-Chuan Chang, Shyi-Ming Chen, and Churn-Jung Liau.: A New Inductive Learning Method for Multilabel Text Categorization. IEA/AIE 2006, LNAI 4031, pp. 1249-- 1258 ( 2006). 4. Hayri Sever, Abdulkadir Gorur, and Mehmet R . Tolun. Text Categorization with ILA A.Yazici and C.S.¸ ener (Eds.): ISCIS, pp. 300--307 (2003).
Advantages Of Our Method Laboratory results, shown in Table, we can observe that the presented method has high precision and accuracy than methods of Decision Tree, Bayes Nets, Naïve Bayes, Similar Find and SVM (indirectly comparison). The method presented by ontology operate fairly, because, the number of positive classified examples turns zero incorrectly and consequently, accuracy of category turns "one". Disadvantages Of Our Method Mentioning this point seems necessary that although our proposed method has higher average in comparison with SVM, it less speed. To solve this problem, the ontology functionality will be possible. 6 FUTURE RESEARCH Work on our method for better quickly. Work on ontology and ontology be updated for better result. Also, ontology can be used in Term Space Reduction (TSR). Certainly, positive and best results will be obtained. 7 REFERENCES 1. Yang, Y., Pedersen, J.: A Comparative Study on Feature Selection in Text Categorization. In Proceedings of the 14th international
ISBN: 978-0-9853483-3-5 ©2013 SDIWC
307