College - Author 1

College of Engineering

Department - Author 1

Computer Science Department

College - Author 2

College of Engineering

Department - Author 2

Computer Science Department


Lubomir Stanchev, College of Engineering, Computer Science Department

Funding Source

The Noyce School of Applied Computing




Document classification is a pivotal task in various domains, warranting the development of robust algorithms. Among these, the Bidirectional Encoder Representations from Transformers (BERT) algorithm, introduced by Google, has proven to perform well when fine-tuned for the task at hand. Leveraging transformer architecture, BERT demonstrates stellar language understanding capabilities. However, the integration of BERT with a range of techniques has shown potential for further enhancing classification accuracy. This work investigates several techniques that leverage semantic understanding to improve the performance of document classification models trained with BERT. Specifically, we explore three methods. First, we will balance corpuses afflicted by imbalanced training data distributions. Next, we substitute particular words with semantically similar words when balancing to create “synthetic documents,” thereby shifting the model's focus from individual words to semantic meaning. Finally, we retrain the model on a dataset comprised of “synthetic documents” with heavier weights given to classes with commonly misclassified documents during the initial round of training. These approaches emphasize the significance of semantic comprehension in document classification, as the meaning of words and phrases often relies on contextual cues. Our findings demonstrate the efficacy of these techniques and highlight the importance of incorporating semantic knowledge in document classification algorithms. On our labeled testing set of news articles from BBC News, BERT performs with a baseline macro F1 score of 0.902. Using all of our techniques, we have improved the macro F1 score to 0.951.

Available for download on Monday, August 26, 2024