"Semantic Document Clustering Using a Similarity Graph" by Lubomir Stanchev

Computer Science and Software Engineering

Title

Semantic Document Clustering Using a Similarity Graph

Author Info

Lubomir Stanchev, California Polytechnic State University, San Luis Obispo

Recommended Citation

Postprint version. Published in 10th IEEE International Conference on Semantic Computing Proceedings: Laguna Hills, CA, February 1, 2016, pages 1-8.

The definitive version is available at https://doi.org/10.1109/ICSC.2016.8.

Abstract

Document clustering addresses the problem of identifying groups of similar documents without human supervision. Unlike most existing solutions that perform document clustering based on keywords matching, we propose an algorithm that considers the meaning of the terms in the documents. For example, a document that contains the words "dog" and "cat" multiple times may be placed in the same category as a document that contains the word "pet" even if the two documents share only noise words in common. Our semantic clustering algorithm is based on a similarity graph that stores the degree of semantic relationship between terms (extracted from WordNet), where a term can be a word or a phrase. We experimentally validate our algorithm on the Reuters-21578 benchmark, which contains 11,362 newswire stories that are grouped in 82 categories using human judgment. We apply the k-means clustering algorithm to group the documents using a similarity metric that is based on keywords matching and one that uses the similarity graph. We show that the second approach produces higher precision and recall, which means that this approach matches closer the results of the human study.

Disciplines

Computer Sciences

Copyright

2016 IEEE.

Number of Pages

Publisher statement

Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Download

Included in

Computer Sciences Commons

COinS

URL: https://digitalcommons.calpoly.edu/csse_fac/263

Computer Science and Software Engineering

Title

Author Info

Recommended Citation

Abstract

Disciplines

Copyright

Number of Pages

Publisher statement

Included in

Search

Browse

Author Corner

LINKS

Computer Science and Software Engineering

Title

Author Info

Recommended Citation

Abstract

Disciplines

Copyright

Number of Pages

Publisher statement

Included in

Share

Search

Browse

Author Corner

LINKS