CONQUIRO: A cluster-based meta-search engine

June 7, 2017 | Autor: Miltiadis Lytras | Categoria: Information Systems, Psychology, Cognitive Science, Information Retrieval, Information Management, Machine Learning, Classification, Search Engine, Hierarchical Clustering, Meta Search Engine, Computers In Human Behavior, Internet, Machine Learning, Classification, Search Engine, Hierarchical Clustering, Meta Search Engine, Computers In Human Behavior, Internet

Share Embed

Denunciar este link

Descrição do Produto

Computers in Human Behavior 27 (2011) 1303–1309

Contents lists available at ScienceDirect

Computers in Human Behavior journal homepage: www.elsevier.com/locate/comphumbeh

CONQUIRO: A cluster-based meta-search engine Maria Vargas-Vera a,*, Yesica Castellanos b, Miltiadis D. Lytras c a

Computing Department, The Open University, Walton Hall, Milton Keynes, MK7 6AA, UK Department of Mathematics, Universidad Nacional Autonoma de Mexico (UNAM), Ciudad Universitaria, Mexico, DF 04510, Mexico c University of Patras Argolidos 40-42, 153-44 Gerakas Attikis, Greece b

a r t i c l e

i n f o

Article history: Available online 21 August 2010

a b s t r a c t This paper presents CONQUIRO a cluster based information retrieval engine. The main task of CONQUIRO is to organize documents in groups/clusters relevant to the request or query. The main purpose of CONQUIRO is to help to manage information in an efﬁcient manner. CONQUIRO uses Machine learning algorithms (Clustering methods) as underlying technology. It has been equipped with hierarchical and nonhierarchical clustering algorithms both using Euclidean and cosine similarity as distance measures. Authors believe that CONQUIRO represents a solution to the problem of information management since CONQUIRO goes beyond just a ranked list of documents (Google like). Ó 2010 Elsevier Ltd. All rights reserved.

1. Introduction The rise in popularity of the Web has created a demand for services which help users to ﬁnd relevant information without having to analyze/look up a large collection of documents. One of such services is for instance, an meta-search engine based on clustering techniques which allows users to visualize quickly relevant information. This system should provide with organized information which then can be used by people to get rid of sets of documents in a more efﬁcient manner. Let us imagine the scenario where a user poses a question such as ‘‘star’’ a keyword-based search engine such as Google will present the user with web pages such as Astronomy, music, plants, animals, technology, advertisements, movies, etc. in any order. In particular, for the query ‘‘star” Google found 27,000,000 documents. However, it is very likely that from this set of offered documents only a few of them might be relevant to the query. Therefore, a better solution to the problem when having short queries (where no background is provided) is to organize documents into clusters. These clusters can be skipped quickly if they are non-relevant. CONQUIRO1 was designed to provide help by retrieving documents and organizing them in a way that user ﬁnds easy to look for them. In general, it is well known that clustering does not improve the performance of information retrieval engine. This due to the fact that clustering algorithms running time is quadratic in the number of documents. However, CONQUIRO (current version) had implemented also a suite of clustering algorithms which runs in a linear time. For example, sufﬁx tree clustering.

* Corresponding author. E-mail address: [email protected] (M. Vargas-Vera). 1 CONQUIRO was implemented using Google API, Matlab and PERL. 0747-5632/$ - see front matter Ó 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.chb.2010.07.025

Our main contributions in this paper are (1) to produce CONQUIRO a meta-search engine which organizes documents in topics given an ambiguous query and (2) to propose a labelling algorithm called ‘‘Common Term in the Cluster” described in Section 2.3. The paper is organised as follows: Section 2 gives a brief background on clustering, feature selection and creation of vectors from documents. Section 3 presents CONQUIRO architecture. Section 4 presents a working example. Section 5 presents a preliminary evaluation. Section 6 reports on related work ﬁnally, Section 7 presents conclusions and future work. 2. Clustering We used clustering techniques with the aim to give the user the ability to browse through databases or web content. CONQUIRO was equipped with non-hierarchical and hierarchical algorithms. This decision was based on the fact that we wanted to explore/analyse which visualisation seems to be easier for users. From our observations we found that humans ﬁnd taxonomies easier to understand than lists of items. Therefore, we were inclined to offer only hierarchical methods. But due to efﬁciency in the clustering process CONQUIRO offers non-hierarchical clustering methods as well. As a reminder to the reader non-hierarchical methods divide a collection into subsets. The most common approach tries to partition N on objects into K groups/clusters. The resulting structure completely depends upon the choice of the K centroids. Non-hierarchical methods are attractive because their low computational cost of O(N) where N is the number of documents. Hierarchical methods wok in an inverse way. These algorithms star by positioning each point in a different cluster. Then, similar clusters are joined to form bigger clusters recursively until there is only one cluster.

1304

M. Vargas-Vera et al. / Computers in Human Behavior 27 (2011) 1303–1309

2.1. Feature selection Feature extraction is one of the most important aspect in clustering since a good selection of features improve the quality of clusters. CONQUIRO has been equipped with methods for feature extraction, one among them is the Document Frequency threshold (DF). However, in Section 5 we only show experiments using the TF-IDF weighting schema. 2.1.1. Document frequency threshold Document Frequency threshold is the simplest technique for vocabulary reduction. It scale up easily to very large corpus. Document frequency is the number of documents in which a term occurs. We compute the document frequency for each unique term and remove from the feature space those terms with term frequency less than a given threshold. The removal of rare terms reduce the dimensionality of the feature space. We deﬁned the document frequency threshold as 1 and removed terms which appear in only one document. The assumption for this decision is that these terms are not very informative. A comparative study of feature selection in text categorization can be found in (Yang & Pedersen, 1997). 2.2. Creating vectors from documents The algorithm can be summarized in the following three main steps: cleaning document, ﬁnding features and ﬁlling vectors using TF (term frequency) and Normalizing vectors. Cleaning documents involves several activities such as: To remove html tags. To convert upper-case letters to lower-case letters. To remove punctuation symbols. To remove stop words (the stop words list is language dependant) To perform stemming using the Porter’s stemming algorithm which is also language dependant (Porter, 1980).

view, a single word describes better the content of a cluster. In fact, we believe that a label consisting of a single word facilitates the user’s task when locating chunks of information. Currently, CONQUIRO offers three labelling methods which are described below. 2.3.1. Methods for creating labels CONQUIRO offers the following labelling methods inverse document frequency, frequent and predictive words and common term in the cluster. Each of them is explained below. 2.3.1.1. Inverse document frequency method. The inverse document frequency method uses terms which have a high value of Freq as labels.

Freq k ¼ Clk log

jCj jCk j

where Clk is the sum of the number of occurrences of the term k in cluster Cl, C is the total number of clusters and Ck is the number of clusters containing the term k (Tonella, Ricca, Pianta, & Girardi, 2003). 2.3.1.2. Frequent and predictive words method. The frequent and predictive words method works by selecting terms as labels based on the product of local frequency and predictiveness (Popescul & Ungar, 2000).

pðtermjclusterÞ

pðtermjclusterÞ pðtermÞ

where p(term|cluster) is the frequency of the term in a given cluster and p(term) is the term’s frequency in the whole collection. 2.3.1.3. Common term in the cluster. ‘‘The Common Term in the cluster’’ method selects as label a term which is contained in most of the clusters of documents or it has a high frequency in the cluster. It is important to say that the term with high frequency in the cluster does not necessarily appears in most of the clusters of documents. ‘‘The common term in the cluster’’ method was designed by ourselves as ﬁrst it seems to perform well since the produced labels identify correctly the generated clusters. However, further evaluation of the method is underway. The algorithm of common term in the cluster is shown as follows. Algorithm: Common term in the cluster

We represent each document vector d as follows:

d ¼ ðw1 ; w2 ; . . . ; wn Þ where wi is the weight of the ith term of document d. In our study we also used TF-IDF (Term Frequency-Inverted Document Frequency) (Salton & Buckley, 1988).

wi ¼ tf i logðN=ni Þ where tfi is the raw frequency of term i in document d, N is the total number of documents in the corpus and ni is the number of documents in the corpus where term i appears. Once vectors are created then a clustering algorithm is applied to the document set (Berry, Drmat, & Jessup, 1999). CONQUIRO offers several clustering algorithms such as k-means, agglomerative algorithms (single and complete linkage), Bisecting k-means, UPGMA (Steinbach, 2000) and Sufﬁx Tree clustering algorithm (Zamir & Etzioni, 1998). Finally, the visualisation component of CONQUIRO presents to users clusters of documents in a tree structure. 2.3. Labelling the clusters Clusters labelling is an essential part in our CONQUIRO metasearch engine. There is not consensus on whether a single word or several words can describe better a cluster. However, in our

Input: Group C, Hash of labels H Output: label of group cl 1: cl = null 2: for each term t of centroid of group C which weight is > 0 3: Compute the number of documents in the group which contain t 4: Obtain the frequency of t in the group 5: Add t and frequency to a List L. 6: end 7: if |C| > 2 and the documents of the group are not equal between themselves 8: Sort L for number of documents in descendent order 9: else 10: Sort L for frequency in descendent order end 11: 12: while cl == null and L contains terms do 13: If t is not in H then cl = t 14: end 15: 16: if cl == null then cl = head(L) 17: 18: Add cl to H

M. Vargas-Vera et al. / Computers in Human Behavior 27 (2011) 1303–1309

In short, the common term method selects as cluster label, the term which can be found in most of documents of the cluster no matter if the term appears in documents of other clusters. 2.4. Remarks on hierarchical and non-hierarchical clustering Hierarchical clustering produces hierarchies of clusters, it therefore contains more information than non-hierarchical algorithms. A well known example (of hierarchical clustering) is UPGMA or un-weighted pair group method with arithmetic mean algorithm (Sneath & Snokal, 1973). However, hierarchical clustering is less efﬁcient with respect to time and space than non-hierarchical clustering. This is not its only drawback: we tested Hierarchical Agglomerative Clustering (HAC) in our visualization tool CONQUIRO and found that HAC with complete linkage produce a more compact hierarchy as opposed to HAC simple linkage which produces elongated trees (called chaining behavior). Furthermore, we added to HAC (complete linkage) an algorithm which performs dendogram pruning to ensure that the depth of the generated tree is not very large. According to Steinbach (2000) bisecting k-means is a good algorithm for clustering documents into a hierarchical structure; their experimental results indicate that the bisecting k-means technique performs as well or better than HAC. An algorithm that affords more ﬂexibility and being linear is also relatively quick, is sufﬁx tree (Zamir & Etzioni, 1998). Therefore, sufﬁx tree has been include in our CONQUIRO meta-search engine. Sufﬁx tree is based on the idea of creating clusters from documents that share common phrases. Full description of the sufﬁx tree clustering method can be found in (Zamir & Etzioni, 1998). 3. Architecture The proposed architecture of our system comprises (Fig. 1): Interface, Documents Processing, Documents Clustering and Visualization module. The notation used in Fig. 1 is as follows: processes are represented by ellipses and cylinders represent data. The Interface is a window menu interface in which an English query is given by a user. Then, the query is sent to a search Engine. Currently, the search engine used by CONQUIRO is Google.

1305

Document Processing includes several processing components such as parser, stemmer. The input to this module is a set of documents retrieved by Google. Document Clustering comprises a library of clustering algorithms. Our ﬁrst prototype contains two type of algorithms hierarchical and a non-hierarchical clustering algorithms. The output of this module is a group of documents organized by a topics (themes). The Visualization module is a front-end which allows user to focus in an speciﬁc cluster. 4. Working example We will explain the process model we described earlier by walking through a speciﬁc example. The question posed in our system a single-word query. Fig. 2 shows a snapshot of a short query – star. This is a single-word query without context. Also Fig. 2 shows a menu of ‘‘clustering parameters’’ and ‘‘labelling parameters’’ which can be selected by the user. Clustering parameters are described below. Search engine which only offers Google in this ﬁrst implementation of CONQUIRO, Clustering. It offers hierarchical and non-hierarchical clustering algorithms. Use as documents. It offers two possibilities snippets or full text. The reason for this is that clustering algorithms using full text provide better performance but they are very expensive in running time. So, we have decided to provide CONQUIRO with both options (snippets or full text). Distance. It offers Euclidean or Cosine distance; Num Clusters. It offers number of clusters for the algorithm k-means. Threshold. It provides the possibility of tuning the clustering algorithm. Term weighting. It offers TF and TF-IDF methods. Labelling parameters: Method. CONQUIRO offers 3 methods for labelling clusters (see further details in Section 2.3.1). Number of terms in label. The number of terms for clusters can be deﬁned by the user.

Fig. 1. CONQUIRO architecture.

1306

M. Vargas-Vera et al. / Computers in Human Behavior 27 (2011) 1303–1309

Fig. 2. Dialogue box where an query is posed – ‘‘star”.

Fig. 3. CONQUIRO’s output after applying sufﬁx tree clustering method.

1307

M. Vargas-Vera et al. / Computers in Human Behavior 27 (2011) 1303–1309

Clusters generated by CONQUIRO are shown in Fig. 3. The right hand side shows a set of found clusters and the left hand side shows the list of documents associated to a selected group/cluster. 5. Experiments and evaluation Experiments were carried out using the query ‘‘star” and a set of 200 documents were retrieved (from Google). The retrieved set of documents was cleaned using algorithm described in Section 2.3. Then, the feature selection was performed using as term weighting: Term frequency (TF) and (TF-IDF) weighting schema. In particular for the query ‘‘star’’ 415 features were obtained after using dimensionality reduction methods. It is worth to remind to the reader that our experiments were performed using both the title and summaries (snippets) of the documents instead of full text. Although, to use full text will improve the clustering results. The decision of using snippets plus title was taken because of the time used to cluster documents (clustering process takes 30 s using full text and it takes 2 s using snippets (Crabtree, 2004)). Currently, summaries/snippets were obtained from the HTML code of each document. We used as evaluation methodology the gold standard approach (used widely in the Information Retrieval ﬁeld) since task oriented methods are more prone to the effects of subjective evaluation. Precision and Recall measures were computed against a gold standard created by a computer scientist. It is worth to mention that there is little in the way of gold standards in clustering except in well-prescribed sub-domains where validity assessments are objective (Dubes, 1993). In our experiment the gold standard was prepared by a computer scientist. The documents (for the query ‘‘star’’) were distributed correctly in 65 clusters. Some of these clusters are as follows: News and media, Energy Star Trek, Non-proﬁt organizations, Astronomy, Real state, Finance and Investment, Employment, Car, Airlines, Star ofﬁce, Music, Unknown, etc. Experiments were carried out using k-means, hierarchical agglomerative clustering using single linkage (the nearest neighbor method) and complete linkage and its variation both using two similarity measures (Euclidean and cosine), Average linkage UPGMA, Bisecting k-means and Sufﬁx Tree clustering. A comparative study is shown in Table 1. Table 1 gives an estimation of Precision and Recall for several clustering algorithms implemented in CONQUIRO. These experiments were carried out using the cosine as similarity measure and TF-IDF weighting schema. Currently, CONQUIRO does not use context knowledge such as synonyms, hyponyms and hypernyms. However, in future implementations we plan to include a thesaurus such like WordNet (Felbaum, 1998) to improve our text clustering results. The performance of each of the methods embedded in CONQUIRO were computed by means of precision an recall measures. The following formulas of precision and recall were used in our evaluations. Precision is deﬁned as total number of documents from each cluster in its represented category divided by the total number of documents in the clusters. Table 1 Comparative table using a suite of clustering methods and snippets.

Recall is deﬁned as the total number of distinct documents from each cluster in its represented category divided by the total number of documents in the user assigned categories. In short, precision measures the accuracy in the clusters whilst recall measure coverage of all documents provided in the corpus set. Further information about these Precision and Recall formulas can be found in (Crabtree, 2004). The evaluation presented in Table 1 shows the following results for the query star: Bisecting k-means performs as good as HAC with complete linkage without dendogram pruning. But, the main drawback: is that the HAC algorithm has a complexity quadratic in time and the bisecting k-means is linear. However, HAC produces better quality clusters. The sufﬁx tree clustering method performs as well as the HAC complete linkage with dendogram pruning. However, the advantage of the sufﬁx tree clustering method is that its running time is linear. HAC simple linkage produced elongated trees. Therefore, the quality of the clusters was not adequate for our main purpose (to help users to visualize information in an efﬁcient manner). The HAC methods do not allow multiple assignments of categories. This multiple assignment can be obtained by using for example Formal Concept Analysis (FCA). However, is well known that FCA is an expensive method (as it needs to keep a complete lattice structure). Performance of each of the clustering methods embedded in CONQUIRO is shown in Graph 1. (Graph 1 shows the precision and recall measures for each of the seven clustering algorithms). The notation used in the graphs presented in this paper is as follows: HAC (S) means hierarchical Agglomerative clustering with single linkage, HAC (C) means hierarchical Agglomerative clustering with complete linkage, HAC (C + P) means hierarchical Agglomerative clustering with complete linkage and dendogram pruning. And B_k-means means bisecting k-means. Finally, Graph 2 shows the performance for the query ‘‘jaguar’’. Precision and recall are higher than in the query ‘‘star’’ since the jaguar’s database contains less outliners. Mostly all jaguar’s documents are similar in topic (i.e., they convey to one of the 34 groups). The jaguar’s database was provided from Daniel Crabtree (Crabtree, 2004). However, in order to meet CONQUIRO’s requirements we just have added the ‘‘title of the document’’ to each of the snippets in the jaguar’s database. We obtained 59% precision and 67% recall using query jaguar and method HAC (C + P) on random data. In a second experiment we had achieved higher precision and recall (89% precision and 86% recall) using query jaguar and method HAC (C + P). It is worth

100 90 80

K-means

70

HAC (S)

60

HAC (C)

50

Method using cosine similarity

Precision (%)

Recall (%)

40

k-Means HAC single linkage HAC complete linkage HAC complete linkage (pruning) Average linkage UPGMA Bisecting k-means Sufﬁx tree

56 51 46 68 49 43 68

56 97 84 84 86 86 45

30

HAC (C+P) UPGMA B_k-means

20

Suffix tree

10 0 precision

recall

Graph 1. Precision and recall measures for the query star.

1308

M. Vargas-Vera et al. / Computers in Human Behavior 27 (2011) 1303–1309

100 90 80

K-means

70

HAC (S)

60

HAC (C)

50

HAC (C+P)

40

UPGMA

30

B_k-means

20

Suffix tree

10 0

nized hierarchically using type co-occurrence known as subsumption. Then, the resulting hierarchical structure is shown as a series of hierarchical menus.

precision

recall

Graph 2. Precision and recall measures for the query jaguar.

to mention that we were unable to compare CONQUIRO’s clustering methods against Crabtree’s work because he implemented different clustering methods. However, we can see from his experiments that he obtained 43% precision and 25% recall using the maximal cluster covering extension on random data. Also, he obtained 86% precision and 41%recall on non-random data. 6. Related work The task of clustering documents (organizing in groups) is a difﬁcult job even for humans. A study about human performance on clustering web pages is reported in (Macskassy, Banerjee, Davison, & Hirsh, 1998). Macskassy et al. claimed that the complexity of the problem lies in the fact that there is not a unique way to cluster documents. Using their own words this means that if subjects do not agree on clusters implies that there is no effective way to cluster documents automatically (Macskassy et al., 1998). Cimiano, Hotho, and Staab 2004a, 2004b presents an analysis of Clustering Algorithms such as conceptual, partitional, agglomerative clustering for learning taxonomies from text. A formal concept analysis algorithm for automatic construction of ontologies is suggested. However, we believe that this algorithm is too slow for learning taxonomies from text, in particular when we are browsing large collections of documents. In fact, the complexity (in the worst case) is exponential on the number of terms to be ordered. Cimiano et al. described an experiment for the tourist domain2 where the Formal Concept Analysis algorithm (FAC) was used. Cimiano and Colleagues claimed that the complexity of FAC algorithm is almost linear for that speciﬁc corpus (tourist domain). Scatter/Gather (Cutting, Karper, Pedersen, & Tukey, 1992) produces an initial set of clusters each of which could re-clustered on the ﬂy to produce more speciﬁc clusters. This process can be made until only singleton documents remain in a cluster. The algorithm used was buckshot to improve time performance. A proposal for Web document clustering can be found in (Zamir & Etzioni, 1998). A sufﬁx tree clustering algorithm is suggested which is linear time complexity. It has been used in conjunction with a Meta-crawler3 at university of Washington. Experiments show that their algorithm over-performs k-means, buckshot and hierarchical clustering algorithms. A method for deriving a hierarchical organization of concepts from a set of documents is described in (Sanderson & Croft, 1999). Sanderson and Croft’s approach does not use training data or clustering approaches. They used salient words and phrases extracted from the documents. These words and phrases are orga2 http://www.lonelyplanet.com, and http://www.all-in-all.de site containing information about accommodation, activities in Mecklenburg-Vorpommern a region in the Northeast of Germany. 3 http://www.cs.washington.edu/research/clustering

7. Conclusions and future work In this paper we have presented an experimental study of several divisible clustering methods and hierarchical clustering methods. The corpus we used was semi-structured documents from the web. However, in this ﬁrst implementation of CONQUIRO we did not use any special property of the semi-structured documents. Our main contribution is to provide the means to make sense of retrieved information in an efﬁcient manner. We believe that CONQUIRO can be used as information management tool. In fact, authors believe that by providing information organized by clusters and a summary of the cluster, users then can get rid of irrelevant information in a more efﬁcient way. Therefore, we have implemented an easy-to-use tool which allows users to visualise clusters of documents organized by topics. This idea is not completely new but it offers beneﬁcial aspects for information management. An analysis of different clustering algorithms was performed to access the performance when retrieving large collection of documents. However, more experiments need to be carried out in order to assess usability aspects. Preliminary results have shown that hierarchical agglomerative clustering (complete linkage) used with cosine similarity and dendogram pruning produces a more compact tree. This feature is one of the desire properties since the user can found relevant information traversing only one or two levels in the generated tree. A disadvantage of the agglomerative complete linkage algorithm is that the algorithm runs in quadratic time. However, we also have equipped CONQUIRO with a suit of methods which run in a linear time (sufﬁx tree, algorithm). Another contribution is to include in CONQUIRO our own labelling method called ‘‘the Common Term in the cluster’’. However, further evaluation of the method will be carried out as future work. CONQUIRO has been implemented in Windows and its clustering algorithms were implemented using Matlab4 and PERL scripts. The Google API also was used to retrieve documents over the web. There is clearly a lot more work needed to make this technology work well enough for large-scale deployment. Further work may include the possibility of to re-organize documents already grouped in a cluster or to remove irrelevant clusters. Further work includes to analyze labelling algorithms to assess which one offers a better labelling performance. Currently, CONQUIRO offers three methods described earlier in the paper (Section 2.3.1.3). Finally, we want to optimize the time required to cluster large data sets. Therefore, as future step we plan to migrate CONQUIRO to a distributed architecture, for instance, a multi-agent architecture.

References Berry, M. W., Drmat, Z., & Jessup, E. R. (1999). Matrices vector spaces and information retrieval. SIAM Review, 41(2), 335–362. Cimiano, P., Hotho, A., & Staab, S. (2004a). Comparing conceptual, divisive and agglomerative clustering for learning taxonomies from text. In Proceedings of the European conference on artiﬁcial intelligence (ECAI04). Cimiano, P., Hotho, A., & Staab, S. (2004b). Clustering concepts hierarchies from text. In Proceedings of the conference on lexical resources and evaluation (LREC), May. Crabtree, D. (2004). Improvements to web page clustering methods. Master Thesis, Victoria University of Wellington, New Zealand. 4 k-Means clustering software can be downloaded from the link: http://people.revoledu.com/kardi/tutorial/kMean/matlab_kMeans.htm

M. Vargas-Vera et al. / Computers in Human Behavior 27 (2011) 1303–1309 Cutting, D. R., Karper, D. R., Pedersen, J. O. & Tukey, J. W. (1992). Scatter/gather: a cluster-based approach to browsing large document collections. In Proceedings of the 15th annual international ACM/SIGIR conference, Copenhagen. Dubes, R. C. (1993). Cluster analysis and related issues. In C. H. Chen, L. F. Pau, & P. S. Wang (Eds.), Handbook of pattern recognition and computer visio (pp. 3–32). River Edge, Nj: Publishing Co., Inc.. Felbaum, C. (Ed.) (1998). Wordnet: an electronic database, The MIT press. Macskassy, S. A., Banerjee, A., Davison, B. D., & Hirsh, H. (1998). Human performance of clustering web pages. Knowledge discovery and data mining, 4, 264–268. Popescul, A., Ungar, H. (2000). Automatic labeling of document clusters. Un-publish paper. Porter, M. F. (1980). An Algorithm for sufﬁx stripping. Program, 14, 130–137. Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5), 513–523. Sanderson, M., & Croft, B. (1999). Deriving concept hierarchies from text. In Proceedings of the 22nd annual international ACM SIGIR conference on research

1309

and development in information retrieval. California, US: Berkeley, pp. 206– 213. Sneath, P. H., & Snokal, R. R. (1973). Numerical taxonomy San Francisco. San Francisco, USA: W.H. Freeman and Company. Steinbach, M. (2000). A comparison of document clustering techniques. In KDD Workshop on Text Mining. Tonella, P., Ricca, F., Pianta, E., & Girardi, C. (2003). Using keyword extraction for web site clustering. In Proceedings WSE, 5th international workshop on web site evolution, pp. 41–48. Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In Proceedings of ICML-97, 14th international conference on machine learning pp. 412-420. Zamir, O., & Etzioni, O. (1998). Web document clustering: a feasibility demonstration. Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval. Melbourne, Australia, pp. 46–54.

Lihat lebih banyak...

CONQUIRO: A cluster-based meta-search engine

Descrição do Produto

Comentários