A Novel Semantic Search Technique Using Ontology

June 2, 2017 | Autor: Mohammad Malakooti | Categoria: Text Classification
Share Embed


Descrição do Produto

A Novel Semantic Search Technique Using Ontology Mahnaz Khani1, Mohammad V. Malakooti2, Mehran Mohsenzadeh3 Graduate Student of Department of Computer Engineering, Islamic Azad University, UAE Branch, Dubai, UAE [email protected], 2 Faculty and Head of Department of Computer Engineering, Islamic Azad University, UAE Branch, Dubai, UAE [email protected] 3 Faculty of Department of Computer Engineering, Islamic Azad University, Science & Research Branch, Tehran, Iran [email protected], 1

ABSTRACT At this method using mutual information for selecting the size of Word document vectors for each category has been created according to the word of choice. Documents obtained by using matrix measure of similarity between documents are derived. Document similarity matrix of the inner product matrix document, the terms related to the size of the document is obtained. The degree of relevance of document types and document tests using vector inner product and the weighted average of those obtained from experiments. The accuracy our method is better than previous methods based on our simulations and evaluations. Also, when we use the ontology precision value is 1.

For example, “asks”, “asked”, and “asking” are all reduced to “ask 3. The words, which are happened less than five times in test set, are deleted, because, the words which are happened only in some period are not dependable statistically. 4. The words, which have been used in a document, are deleted. 5. MI size of remaining words is obtained and 300 words are used for testing classifiers in each category which their size are higher than other words. MI (t i , C )  P(t i , C ) log 2 P(t i , C ) 

N c (t i ) NC

P(t i , c) P(t i ) P(c)

, P(t i ) 

N (t i ) NC

(1) PC 

NC N

(2)

N c (ti ) : The number of word in category c

KEYWORDS: Ontology, TSR, Text Classification, Web Mining, Semantic Web.

N c : The number of all words in category c N (t i )

1 INTRODUCTION The main causes of work, Problem in Search, not get Complete and correct information and Low quality search results. We consider a matrix K x N, k is number of words and N is number of document. If there is a word in the document, its value is one. Otherwise it is zero. Then we calculate the matrix s and R. Weight of words and their scores are calculated. Finally, we use the ontology. If the word of ontology is closer to category we consider that word is true and positive. Otherwise, we consider that word is false. 2 THE MAIN STAGES OF OUR METHOD 1. Removal Stop Word such as “ a, the, an, that, so...” 2. Stemming(porter)

ISBN: 978-0-9853483-3-5 ©2013 SDIWC

: The number of word in set

N : Number of all words in set 6. One K x N matrix is taken into consideration in a way that K is number of words and N refers to the number of documents in each layer. This matrix is exponent matrix of document and displays binary weights [1, 0] of words in documents. If word "ti" is observed in di document, one value is considered, otherwise, zero or null value is displayed. 7. Making S document similarity matrix. 8. Relation between word and document in R matrix is obtained by internal multiplication of A matrix in S matrix. 9. Then, R matrix is multiplied in vector I and Vc vector is obtained for category C. 10. At this part, Vc is normalized through the application of vector average weight method. 11. Weight of word i in category C is obtained by:

303

WCi  vci  log 2

c cf i

(3)

c = is the number of categories.

classified as true positive but our way is the only Wheat classified as true positive. Because it is a semantic and ontology nearly corn.

cf i : is the number of vectors which include word ti. 12. By multiplication of d new test document in WCi ,

4 EVALUATION WITH MICRO-AVERAGING

test document relation degree to category c is obtained. 13. Using Ontology, if rank of some words is turned positive to category C, based on used ontology, the word, which is nearer to category C semantically, is categorized positive and will be considered correctly and the rest words, which are nearer to category C, is categorized positive and are considered incorrectly. Figure 1 Process, which was summarized for testing various learning algorithms, is shown.

At this method, micro-averaging has been used for evaluation of proposed method.

Score(c, d new )  d new.wc

Pr ecision 

Re call 





c

i 1

TPi

TPi  i 1 FPi i 1



c

c



c

i 1

c

i 1

TPi

TPi  i 1 FN i c

(4)

Text files Index server

Find similar

Word counts per file Feature selection Data set Learning methods

Decision tree Naïve bayes Bayes net

SVM

Proposed method

Figure 1. Learning Algorithms Test Process Table 1. Example Score in corn

λ=0.87

Interest

0.92

TP

FP

Rate

0.89

TP

FP

Term

Ontology

TPi: number of words classified as true and Positive in category Ci. FNi: number of words classified as false and negative in category Ci. FPi: number of words classified as false and Positive in category Ci. |c|: Number of categories. 4.1 Implementation To implement our method, we use the SQL Server2008 and Delphi Version 7.we have used 10category data entitled "Reuters–21578 Apte Split" which includes 9603 Reports in training Group and 3999 Reports in testing Group for train and test our method. Table 2. 10-category data entitled "Reuters – 21578”

Wheat

0.91

TP

TP

3 EXAMPLE OF METHOD For example, we have three words that score in this category is corn. In the previous method, Because the score is more than a threshold, three words are

ISBN: 978-0-9853483-3-5 ©2013 SDIWC

Category Names

Number of Testing samples

Number of Training Samples

Earn

1087

2877

Acq

719

1650

Money-fx Grain

179 149

538 423

Crude

189

389

Trade

118

369

Interest Wheat

131 71

347 212

Ship

89

197

corn

56

182

304

Table 4. Average for 10 Categories of Reuters with Various Threshold

4.2 Comparison of Methods

0.9

0.87

0.8

0.7

0.6

0.5

0.4

0.3

0.2 Threshold

0.5728 0.5182 0.5818 0.5272 0.6454 0.6364

0.6454

0.6364

0.6454

0.6454

0.5546

0.5272

0.6454

0.6454

0.5909

0.5818

0.5454

0.6454

0.6728

0.5636

0.8818

0.5818

0.7091

0.6818

0.8546

0.6454

0.5728

0.8818

0.7546

0.8

0.7272

0.963

0.6454

0.6728

0.6364

0.8818

0.7728

0.9364

0.936

1

0.5272

0.6

0.6728

0.7546

0.9091

0.7728

0.9364

0.936

0.5818

0.6546

0.8636

1

0.9

1

1

1

0.6546

0.5636

0.6728

0.9364

1

1

1

1

0.5728

0.5636

0.8091

1

1

1

1

0.6364

0.5636

0.9364

1

1

1

0.5546

0.5636

1

1

1

1

0.6549

0.5636

1

1

1

0.5636

0.6

1

1

0.818

1

1994

1

Chang,Chen,2006

1

1995

1996

1

1971

1

Category

1

λ=78.0

0.972

Vapnik

corn

Chinkering, D.,

The proposed method

Sahami

Wheat

Multilabel Method

Ship

Linear SVM

Interest

Lewis, D.D.,

Decision trees

Trade

Rocchio

Bayes Nets

Crude

Naïve Bayes

Grain

Find similar

Acq

Method

Earn

Table 3. Breakeven Efficiency for Reuters – 21578 Apte Split 10 Categories

0.1

Categorie

s

The comparison results have been presented in table 3. Dataset that we have used the following methods is used, so the average of the proposed method compared with others.

With ontology

1997

95. 9 %

95. 8 %

97. 8 %

98. 0%

97.5%

60%

64. 7 %

87. 8 %

88. 3 %

89. 7 %

93. 6%

95.1%

100%

Money-

46. 7 %

56. 6 %

58. 8 %

66. 2 %

74. 5%

79.2%

100%

fx Grain

67. 5 %

78. 8 %

81. 4 %

85. 0 %

94. 6%

84.7%

100%

Crude

70. 1 %

79. 5 %

79. 6 %

85. 0 %

88. 9%

84.4%

100%

Trade

65. 1 %

63. 9 %

69. 0 %

72. 5 %

75. 9

85%

100%

Interest

63. 4 %

64. 9 %

71. 3 %

67. 1 %

77. 7%

81%

100%

Ship

49. 2 %

85. 4 %

84. 4 %

74. 2 %

85. 6%

85.4%

93.64%

Wheat

68. 9%

69. 7%

82. 7%

92. 5%

91. 8%

79.8%

93.64%

corn

48. 2 %

65. 3 %

76. 4 %

91. 8 %

90. 3%

78.2%

96.36%

Average

64. 6 %

81. 5 %

85 %

88. 4 %

92. 0%

91.3%

94.5%

4.3 Setting Threshold Limit There are various policies and strategies for determination of λ threshold depending on restrictions of application. The most important differentiation is this that: "Has threshold limit been obtained analytically or laboratory based?" • Analysis to determine the threshold called CSV or S-cut. • Experimentally determine threshold called Pcut • Precision and Recall values for the uniform adjustments to arrive at an appropriate value between these values called R-cut. • The threshold limit used in proposed method is based on [R-Cut] [Yang, 1999].

305

58%

60%

63%

68%

76.5%

82.5%

88.5%

94.5%

97%

99.5%

ISBN: 978-0-9853483-3-5 ©2013 SDIWC

Average

The following table of average, number of words classified as true and Positive in category Ci (TPi), number of words classified as false and negative in category Ci (FNi), Recall and Precision for 10 Categories of Reuters at Various Thresholds is shown.

Money

92. 9 %

Acq

-fx

Earn

Table 5. Recall, Precision, Average for 10 Categories of Reuters at Various Thresholds Threshold

0.9 0.87

0.8

0.7

0.6

0.5

0.4

FN1=9

FN1=9 TP1=1

TP1=1

FN1=9

TP1=1

TP2=3

FN2=8

FN2=8 TP2=2

TP2=2

FN2=9

TP2=1

TP3=8

FN3=7

FN3=8 TP3=3

TP3=2

FN3=8

TP3=2

FN4=6

TP4=4

FN4=6

FN4=8 TP4=4

TP4=2

FN4=8

TP4=2

FN5=8

TP5=2

FN5=9

FN5=9 TP5=1

TP5=1

FN5=9

TP5=1

FN6=2

TP6=8

FN6=4

FN6=7 TP6=6

TP6=3

FN6=9

TP6=1

TP7=4

FN7=8

TP7=2

FN7=8

FN7=8 TP7=2

TP7=2

FN7=8

TP7=2

TP8=5

FN8=8

TP8=2

FN8=9

FN8=9 TP8=1

TP8=1

FN8=9

TP8=1

TP9=3

FN9=7

TP9=3

FN9=7

FN9=7 TP9=3

TP9=3

FN9=7

TP9=3

TP10=3

FN10=7

TP10=3

FN10=7

FN10=7 TP10=3

TP10=3

FN10=8

TP10=2

TP1=1 FN3=2

TP6=8

FN10=7

FN2=7 TP5=4

FN9=7

FN1=9 TP4=10

FN8=5

TP3=9

FN7=6

TP10=4

TP2=6

FN6=2

TP9=4

TP1=1 FN5=6

TP8=7

FN3=1

TP7=5

FN2=4

TP6=9

FN1=9

TP5=7

FN10=6

TP4=10

FN9=6

TP10=7

TP3=10

FN8=3

TP9=4

TP2=8

FN6=1

TP8=9

TP1=1

FN5=3

FN7=5 TP7=5

FN2=2

TP6=10

FN10=3

FN1=9

TP5=10

FN9=6

TP10=9

TP1=2

TP2=10

TP3=10

TP4=10

TP9=9

FN10=1

FN1=8 TP1=5 FN1=5

TP3=10

FN8=1

FN9=1

TP10=

TP2=10

TP8=9

TP9=9

10

TP1=1

FN7=5

FN8=1

FN9=1

TP10=10

TP4=10

TP5=10

TP6=10

1 0.94

1 0.89

0.77 1

0.65 1

0.53 1

0.36 1

0.26 1

0.20 1

0.16

99.5%

97%

94.5%

88.5%

82.5%

76.5%

68%

63%

60%

58%

Average

1 0.99

306

ISBN: 978-0-9853483-3-5 ©2013 SDIWC

FN1=9

TP7=10

TP8=10

TP9=

TP6=10

TP8=

Interest

TP5=10

TP7=

Trade

TP4=10

TP6=

10

Crude

TP3=10

TP5=

10

Grain

TP2=10

TP4=

10

Money-fx

0.3

TP3=

10

Acq

TP7=10

TP2=

10

Earn

0.2

TP1=9

10

corn

10

Wheat

10

Ship

FN1=1

1

Precision

0.1

Categories

Recall

5 CONCLUSION At this method, micro-averaging is used for evaluation of our proposed method i.e. Recall and Precision. If rank of some words to category C is turned positive, based on used ontology, the word which is nearer to category C, is classified positive and is considered as correct and the rest words, which are classified positive, will be considered incorrect with regard to category C. At this stage, significance of ontology is specified and causes words to be classified in correct categories accurately and correctly with more precision. Laboratory results, shown in Table 3, we can observe that the presented method has high precision and accuracy than methods of Decision Tree, Bayes Nets, Naïve Bayes, Similar Find and SVM.

conference on machine learning, Nashville, USA, pp. 412--420 )1997(. 2. F, Sebastiani.: Machine Learning in Automated Text Categorization. ACM Computing Survey, 34(1): 1--47, March )2002(. 3. Yu-Chuan Chang, Shyi-Ming Chen, and Churn-Jung Liau.: A New Inductive Learning Method for Multilabel Text Categorization. IEA/AIE 2006, LNAI 4031, pp. 1249-- 1258 ( 2006). 4. Hayri Sever, Abdulkadir Gorur, and Mehmet R . Tolun. Text Categorization with ILA A.Yazici and C.S.¸ ener (Eds.): ISCIS, pp. 300--307 (2003).

Advantages Of Our Method  Laboratory results, shown in Table, we can observe that the presented method has high precision and accuracy than methods of Decision Tree, Bayes Nets, Naïve Bayes, Similar Find and SVM (indirectly comparison).  The method presented by ontology operate fairly, because, the number of positive classified examples turns zero incorrectly and consequently, accuracy of category turns "one". Disadvantages Of Our Method Mentioning this point seems necessary that although our proposed method has higher average in comparison with SVM, it less speed. To solve this problem, the ontology functionality will be possible. 6 FUTURE RESEARCH  Work on our method for better quickly.  Work on ontology and ontology be updated for better result.  Also, ontology can be used in Term Space Reduction (TSR). Certainly, positive and best results will be obtained. 7 REFERENCES 1. Yang, Y., Pedersen, J.: A Comparative Study on Feature Selection in Text Categorization. In Proceedings of the 14th international

ISBN: 978-0-9853483-3-5 ©2013 SDIWC

307

Lihat lebih banyak...

Comentários

Copyright © 2017 DADOSPDF Inc.