College of Engineering Summer Undergraduate Research Program

Improving Semantic Document Classification Accuracy by Integrating Human-Crafted Knowledge

Zachary Weinfeld, California Polytechnic State University, San Luis ObispoFollow
Lubomir Stanchev, California Polytechnic State University, San Luis ObispoFollow

College - Author 1

College of Engineering

Department - Author 1

Computer Science Department

College - Author 2

College of Engineering

Department - Author 2

Computer Science Department

Advisor

Lubomir Stanchev, College of Engineering, Computer Science Department

Funding Source

The Noyce School of Applied Computing

Date

10-2023

Abstract/Summary

Document classification is a pivotal task in various domains, warranting the development of robust algorithms. Among these, the Bidirectional Encoder Representations from Transformers (BERT) algorithm, introduced by Google, has proven to perform well when fine-tuned for the task at hand. Leveraging transformer architecture, BERT demonstrates stellar language understanding capabilities. However, the integration of BERT with a range of techniques has shown potential for further enhancing classification accuracy. This work investigates several techniques that leverage semantic understanding to improve the performance of document classification models trained with BERT. Specifically, we explore three methods. First, we will balance corpuses afflicted by imbalanced training data distributions. Next, we substitute particular words with semantically similar words when balancing to create “synthetic documents,” thereby shifting the model's focus from individual words to semantic meaning. Finally, we retrain the model on a dataset comprised of “synthetic documents” with heavier weights given to classes with commonly misclassified documents during the initial round of training. These approaches emphasize the significance of semantic comprehension in document classification, as the meaning of words and phrases often relies on contextual cues. Our findings demonstrate the efficacy of these techniques and highlight the importance of incorporating semantic knowledge in document classification algorithms. On our labeled testing set of news articles from BBC News, BERT performs with a baseline macro F1 score of 0.902. Using all of our techniques, we have improved the macro F1 score to 0.951.

October 1, 2023.

Download

Included in

Artificial Intelligence and Robotics Commons, Data Science Commons

COinS

URL: https://digitalcommons.calpoly.edu/ceng_surp/1

College of Engineering Summer Undergraduate Research Program

Improving Semantic Document Classification Accuracy by Integrating Human-Crafted Knowledge

College - Author 1

Department - Author 1

College - Author 2

Department - Author 2

Advisor

Funding Source

Date

Abstract/Summary

Included in

Search

Browse

Author Corner

LINKS

College of Engineering Summer Undergraduate Research Program

Improving Semantic Document Classification Accuracy by Integrating Human-Crafted Knowledge

Author(s) Information

College - Author 1

Department - Author 1

College - Author 2

Department - Author 2

Advisor

Funding Source

Date

Abstract/Summary

Included in

Share

Search

Browse

Author Corner

LINKS