DSC-TEXTRANK

May 26, 2017 | Autor: 旭峰吴 | Categoria: Natural Language Processing, Machine Learning

Descrição do Produto

Algorithm: Here I’m going to elaborate my new text summarization algorithm. The algorithm contains two parts: 1. DSC 2. TextRank Before we get into the main body of the algorithm, we need to do some pre-processing on the target Chinse document. Different from English, Chinese has no explicit segmentation between words. While most of the extractive summarization algorithms, including the one this project proposed, regard sentences as vectors composed of words. Therefore, the first step of the preprocessing is word segmentation. So far the pre-processing is done. And then we get into the first part of the algorithm – DSC. DSC is short for Dominant Set Clustering, a clustering algorithm proposed in 2007 by Massimiliano Pavan and Marcello Pelillo. The advantage of this algorithms is that it can automatically determine the number of clusters, and what’s more, calculate the probability of each vertex’s belonging to its cluster. The iterative equation for DSC is: (𝐴𝑋(𝑡))𝑖 𝑥𝑖 (𝑡 + 1) = 𝑥𝑖 𝑡 𝑋 ∗ 𝐴𝑋(𝑡) Here, 𝑖 is the index of the sentence, 𝑥 is the support vector for this sentence, 𝑡 is the iteration times and 𝐴 is the similarity matrix of all the sentences. Since we need to use the similarity matrix, we have to define similarity first. I chose cosine similarity, and accordingly, the elements of the vertex vector are word frequencies. For instance, if sentence A has 0 word1, 1 word2 and 0 word3, and sentence B has 1 word1, 2 word2s and 0 word3, the cosine similarity between A and B would be: cossim(A,B) = (0,1,0)∙(1,2,0) / |(0,1,0)||(1,2,0)| = 0.894 The steps to calculate the clusters are as follows: 1. Apply the iterative equation until 𝑥 converges 2. Pick out all the vertexes (sentences) with 𝑥𝑖 > 0 as a cluster, and delete them from the similarity matrix 𝐴 3. Repeat step 1 and step 2 until there’s no vertex left A good feature of this method is that we can know the possibility of one vertex’s belonging to its cluster. The reason is in text summarization we can assume a sentence would be more representative if it had a higher probability. The purpose of conducting DSC is we would like to discover the topics of the article. Inside each topic – or say cluster, I designed the following scoring scheme. For each cluster: 𝑁

𝑁

1 𝑖𝑛𝑡𝑒𝑟_𝑠𝑐𝑜𝑟𝑒𝑖 = ∑ ∑ 𝑠𝑖𝑚(𝑗, 𝑘) ∗ log⁡(𝑁) 𝑁 𝑗=1 𝑘=𝑗

𝑖𝑛𝑡𝑒𝑟_𝑠𝑐𝑜𝑟𝑒𝑖 measures the degree of coherence of a cluster. Here 𝑁 is the size of the cluster, and log (𝑁) is a factor to prevent the large-size clusters’ lengths having excessive impact on the result. 1

𝑜𝑟𝑑𝑒𝑟_𝑠𝑐𝑜𝑟𝑒𝑖 = 𝛼 1+𝑖 𝑜𝑟𝑑𝑒𝑟_𝑠𝑐𝑜𝑟𝑒𝑖 measures the order’s importance. First-come clusters would have higher order scores, for the vertexes in it are more alike and more dominant among all the vertexes. Here 𝛼 is an

empirical coefficient. Then, we would get the final cluster score: 𝑐𝑙𝑢𝑠𝑡𝑒𝑟_𝑠𝑐𝑜𝑟𝑒𝑖 = 𝑖𝑛𝑡𝑒𝑟_𝑠𝑐𝑜𝑟𝑒𝑖 + 𝑜𝑟𝑑𝑒𝑟_𝑠𝑐𝑜𝑟𝑒𝑖 For comparing with other scores, all the cluster scores should be normalized to the range of 0 to 1. The first part of the algorithm is done now. Then we come to the second part – TextRank. The formula for TextRank is: 𝑤𝑗𝑖 𝑊𝑆(𝑉𝑖 ) = (1 − 𝑑) + 𝑑 ∗ ∑ 𝑊𝑆(𝑉𝑗 ) ∑𝑉𝑘 ∈𝑂𝑢𝑡(𝑉𝑗) 𝑤𝑗𝑘 𝑉𝑗 ∈𝐼𝑛(𝑉𝑖 )

Although the original design for this algorithm was using directed graphs, it can also be applied to undirected graphs. Then the similarity between two nodes becomes the edge weight. The original version of TextRank used TF-IDF as similarity metric, and here we used another similarity algorithm named BM25, which may has a better quality than TF-IDF. Once implement the TextRank algorithm, it would be applied in two ways: global TextRank and inner-cluster TextRank. The global TextRank score of a sentence is obtained through applying TextRank to all the vertexes. While inner-cluster TextRank score only concerns the vertexes within same cluster. The detailed method of calculating inner-cluster TextRank score is: For each sentence 𝑖 in cluster 𝑘, 𝑖𝑛𝑛𝑒𝑟_𝑡𝑒𝑥𝑡𝑟𝑎𝑛𝑘_𝑠𝑐𝑜𝑟𝑒𝑖 = 𝑖𝑛𝑛𝑒𝑟_𝑏𝑎𝑠𝑒_𝑠𝑐𝑜𝑟𝑒𝑖 ∗ 𝑜𝑟𝑑𝑒𝑟_𝑠𝑐𝑜𝑟𝑒𝑘 Here 𝑖𝑛𝑛𝑒𝑟_𝑏𝑎𝑠𝑒_𝑠𝑐𝑜𝑟𝑒𝑖 is the textrank score of sentence 𝑖 inside the cluster , and 𝑜𝑟𝑑𝑒𝑟_𝑠𝑐𝑜𝑟𝑒𝑘 is the order score of cluster 𝑘 as mentioned in DSC’s part. Note that 𝑖𝑛𝑡𝑒𝑟_𝑏𝑎𝑠𝑒_𝑠𝑐𝑜𝑟𝑒𝑖 is also normalized to the range of 0 to 1. Also, we calculate the global TextRank score for each sentence 𝑖 so we get 𝑜𝑢𝑡𝑒𝑟_𝑡𝑒𝑥𝑡𝑟𝑎𝑛𝑘_𝑠𝑐𝑜𝑟𝑒𝑖. Finally, after the DSC’s clustering and TextRank’s ranking, we can work out the final score for each sentence 𝑖 of cluster 𝑘: 𝑓𝑖𝑛𝑎𝑙_𝑠𝑐𝑜𝑟𝑒𝑖 = 𝑜𝑢𝑡𝑒𝑟_𝑡𝑒𝑥𝑡𝑟𝑎𝑛𝑘_𝑠𝑐𝑜𝑟𝑒𝑖 + 𝛼 ∗ 𝑐𝑙𝑢𝑠𝑡𝑒𝑟_𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦𝑖 + 𝛽 ∗ 𝑖𝑛𝑛𝑒𝑟_𝑡𝑒𝑥𝑡𝑟𝑎𝑛𝑘_𝑠𝑐𝑜𝑟𝑒𝑖 + 𝛾 ∗ 𝑐𝑙𝑢𝑠𝑡𝑒𝑟_𝑠𝑐𝑜𝑟𝑒𝑘 Here, 𝑐𝑙𝑢𝑠𝑡𝑒𝑟_𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦𝑖 is the probability of sentence 𝑖’s belonging to its cluster. 𝛼, 𝛽 and 𝛾 are empirical coefficients, which were 1.5, 0.5 and 1 respect in this project.

Lihat lebih banyak...

DSC-TEXTRANK

Descrição do Produto

Comentários