"Fine-tuning an algorithm for semantic document clustering using a simi" by Lubomir Stanchev

Computer Science and Software Engineering

Title

Fine-tuning an algorithm for semantic document clustering using a similarity graph

Author Info

Lubomir Stanchev, California Polytechnic State University - San Luis ObispoFollow

Recommended Citation

Postprint version. Published in International Journal of Semantic Computing, Volume 10, Issue 4, December 1, 2016, pages 527-555.

The definitive version is available at https://doi.org/10.1142/S1793351X16400195.

Abstract

In this article, we examine an algorithm for document clustering using a similarity graph. The graph stores words and common phrases from the English language as nodes and it can be used to compute the degree of semantic similarity between any two phrases. One application of the similarity graph is semantic document clustering, that is, grouping documents based on the meaning of the words in them. Since our algorithm for semantic document clustering relies on multiple parameters, we examine how fine-tuning these values affects the quality of the result. Specifically, we use the Reuters-21578 benchmark, which contains 11,362 newswire stories that are grouped in 82 categories using human judgment. We apply the k-means clustering algorithm to group the documents using a similarity metric that is based on keywords matching and one that uses the similarity graph. We evaluate the results of the clustering algorithms using multiple metrics, such as precision, recall, f-score, entropy, and purity.

Disciplines

Computer Sciences

COinS

URL: https://digitalcommons.calpoly.edu/csse_fac/253

Computer Science and Software Engineering

Title

Author Info

Recommended Citation

Abstract

Disciplines

Copyright

Number of Pages

Included in

Search

Browse

Author Corner

LINKS

Computer Science and Software Engineering

Title

Author Info

Recommended Citation

Abstract

Disciplines

Copyright

Number of Pages

Included in

Share

Search

Browse

Author Corner

LINKS