Computer Science and Software Engineering

Improving Document Classification by Integrating Human-Crafted Semantic Knowledge

Lubomir Stanchev, California Polytechnic State University - San Luis ObispoFollow
Zachary Weinfeld, California Polytechnic State University, San Luis ObispoFollow

Recommended Citation

IEEE 18th International Conference on Semantic Computing (ICSC), Winter January 1, 2024, pages 1-8.

The definitive version is available at https://doi.org/10.1109/ICSC59802.2024.00008.

Abstract

Document classification is a pivotal task in various domains, warranting the development of robust algorithms. Among these, the Bidirectional Encoder Representations from Transformers (BERT) algorithm, introduced by Google, has proven to perform well when fine-tuned for the task at hand. Leveraging transformer architecture, BERT demonstrates stellar language understanding capabilities. However, the integration of BERT with a range of approaches has shown potential for further enhancing classification accuracy. This paper investigates three techniques that leverage semantic understanding to improve the performance of document classification models trained with BERT. First, we balance corpora afflicted by imbalanced training data distributions. Next, we substitute words and phrases with semantically similar terms to create “synthetic documents,” thereby shifting the focus from individual words to semantic meaning. Finally, we retrain our model on a dataset composed of “synthetic documents” with heavier weights given to classes with commonly misclassified documents during the initial round of training. These approaches emphasize the significance of semantic comprehension in document classification, as the meaning of words and phrases often relies on contextual cues. Our findings demonstrate the efficacy of these techniques and highlight the importance of incorporating semantic knowledge in document classification algorithms. On our labeled testing set of news articles from BBC News, BERT performs with a baseline macro F1 score of 0.902. Incorporating the three techniques, we were able to improve the macro F1 score to 0.951.

Disciplines

Computer Sciences

Copyright

Number of Pages

Download

Included in

Computer Sciences Commons

COinS

URL: https://digitalcommons.calpoly.edu/csse_fac/274

Computer Science and Software Engineering

Improving Document Classification by Integrating Human-Crafted Semantic Knowledge

Recommended Citation

Abstract

Disciplines

Copyright

Number of Pages

Included in

Search

Browse

Author Corner

LINKS

Computer Science and Software Engineering

Improving Document Classification by Integrating Human-Crafted Semantic Knowledge

Author Info

Recommended Citation

Abstract

Disciplines

Copyright

Number of Pages

Included in

Share

Search

Browse

Author Corner

LINKS