Recommended Citation
IEEE 18th International Conference on Semantic Computing (ICSC), Winter January 1, 2024, pages 1-8.
The definitive version is available at https://doi.org/10.1109/ICSC59802.2024.00008.
Abstract
Document classification is a pivotal task in various domains, warranting the development of robust algorithms. Among these, the Bidirectional Encoder Representations from Transformers (BERT) algorithm, introduced by Google, has proven to perform well when fine-tuned for the task at hand. Leveraging transformer architecture, BERT demonstrates stellar language understanding capabilities. However, the integration of BERT with a range of approaches has shown potential for further enhancing classification accuracy. This paper investigates three techniques that leverage semantic understanding to improve the performance of document classification models trained with BERT. First, we balance corpora afflicted by imbalanced training data distributions. Next, we substitute words and phrases with semantically similar terms to create “synthetic documents,” thereby shifting the focus from individual words to semantic meaning. Finally, we retrain our model on a dataset composed of “synthetic documents” with heavier weights given to classes with commonly misclassified documents during the initial round of training. These approaches emphasize the significance of semantic comprehension in document classification, as the meaning of words and phrases often relies on contextual cues. Our findings demonstrate the efficacy of these techniques and highlight the importance of incorporating semantic knowledge in document classification algorithms. On our labeled testing set of news articles from BBC News, BERT performs with a baseline macro F1 score of 0.902. Incorporating the three techniques, we were able to improve the macro F1 score to 0.951.
Disciplines
Computer Sciences
Copyright
979-8-3503-8535-9/24/$31.00 ©2024 IEEE
Number of Pages
8
URL: https://digitalcommons.calpoly.edu/csse_fac/274