Date of Award

6-2024

Degree Name

MS in Computer Science

Department/Program

Computer Science

College

College of Engineering

Advisor

Lubomir Stanchev

Advisor Department

Computer Science

Advisor College

College of Engineering

Abstract

Relation extraction (RE) is a task within natural language processing focused on the classification of relationships between entities in a given text. Primary applications of RE can be seen in various contexts such as knowledge graph construction and question answering systems. Traditional approaches to RE tend towards the prediction of relationships between exactly two entity mentions in small text snippets. However, with the introduction of datasets such as DocRED, research in this niche has progressed into examining RE at the document-level. Document-level relation extraction (DocRE) disrupts conventional approaches as it inherently introduces the possibility of multiple mentions of each unique entity throughout the document along with a significantly higher probability of multiple relationships between entity pairs.

There have been many effective approaches to document-level RE in recent years utilizing various architectures, such as transformers and graph neural networks. However, all of these approaches focus on the classification of a fixed number of known relationships. As a result of the large quantity of possible unique relationships in a given corpus, it is unlikely that all interesting and valuable relationship types are labeled before hand. Furthermore, traditional naive approaches to clustering on unlabeled data to discover novel classes are not effective as a result of the unique problem of large true negative presence. Therefore, in this work we propose a multi-step filter and train approach leveraging the notion of contrastive representation learning to discover novel relationships at the document level. Additionally, we propose the use of an alternative pretrained encoder in an existing DocRE solution architecture to improve F1 performance in base multi-label classification on the DocRED dataset by 0.46.

To the best of our knowledge, this is the first exploration of novel class discovery applied to the document-level RE task. Based upon our holdout evaluation method, we increase novel class instance representation in the clustering solution by 5.5 times compared to the naive approach and increase the purity of novel class clusters by nearly 4 times. We then further enable the retrieval of both novel and known classes at test time provided human labeling of cluster propositions achieving a macro F1 score of 0.292 for novel classes. Finally, we note only a slight macro F1 decrease on previously known classes from 0.402 with fully supervised training to 0.391 with our novel class discovery training approach.

Included in

Data Science Commons

Share

COinS