Date of Award

6-2022

Degree Name

MS in Computer Science

Department/Program

Computer Science

College

College of Engineering

Advisor

Lubomir Stanchev

Advisor Department

Computer Science

Advisor College

College of Engineering

Abstract

Though exploring one’s family lineage through genealogical family trees can be insightful to developing one’s identity, this knowledge is typically held behind closed doors by private companies or require expensive technologies, such as DNA testing, to uncover. With the ever-booming explosion of data on the world wide web, many unstructured text documents, both old and new, are being discovered, written, and processed which contain rich genealogical information. With access to this immense amount of data, however, entails a costly process whereby people, typically volunteers, have to read large amounts of text to find relationships between people. This delays having genealogical information be open and accessible to all.

This thesis explores state-of-the-art methods for relation extraction across the genealogical and biomedical domains and bridges new and old research by proposing an updated three-tier system for parsing unstructured documents. This system makes use of recently developed and massively pretrained transformers and fine-tuning techniques to take advantage of these deep neural models’ inherent understanding of English syntax and semantics for classification.

With only a fraction of labeled data typically needed to train large models, fine-tuning a LUKE relation classification model with minimal added features can identify genealogical relationships with macro precision, recall, and F1 scores of 0.880, 0.867, and 0.871, respectively, in data sets with scarce (∼10%) positive relations. Further- more, with the advent of a modern coreference resolution system utilizing SpanBERT embeddings and a modern named entity parser, our end-to-end pipeline can extract and correctly classify relationships within unstructured documents with macro precision, recall, and F1 scores of 0.794, 0.616, and 0.676, respectively. This thesis also evaluates individual components of the system and discusses future improvements to be made.

Share

COinS