College - Author 1
College of Engineering
Department - Author 1
Computer Science Department
College - Author 2
College of Engineering
Department - Author 2
Computer Science Department
Advisor
Lubomir Stanchev, College of Engineering, Computer Science Department
Funding Source
The Noyce School of Applied Computing
Date
10-2023
Abstract/Summary
Document classification is a pivotal task in various domains, warranting the development of robust algorithms. Among these, the Bidirectional Encoder Representations from Transformers (BERT) algorithm, introduced by Google, has proven to perform well when fine-tuned for the task at hand. Leveraging transformer architecture, BERT demonstrates stellar language understanding capabilities. However, the integration of BERT with a range of techniques has shown potential for further enhancing classification accuracy. This work investigates several techniques that leverage semantic understanding to improve the performance of document classification models trained with BERT. Specifically, we explore three methods. First, we will balance corpuses afflicted by imbalanced training data distributions. Next, we substitute particular words with semantically similar words when balancing to create “synthetic documents,” thereby shifting the model's focus from individual words to semantic meaning. Finally, we retrain the model on a dataset comprised of “synthetic documents” with heavier weights given to classes with commonly misclassified documents during the initial round of training. These approaches emphasize the significance of semantic comprehension in document classification, as the meaning of words and phrases often relies on contextual cues. Our findings demonstrate the efficacy of these techniques and highlight the importance of incorporating semantic knowledge in document classification algorithms. On our labeled testing set of news articles from BBC News, BERT performs with a baseline macro F1 score of 0.902. Using all of our techniques, we have improved the macro F1 score to 0.951.
October 1, 2023.
URL: https://digitalcommons.calpoly.edu/ceng_surp/1