Abstract

The paper tackles the task of extracting genealogical relationships, such as “sibling-of”, “parent-of”, “child-of”, and “spouse-of”, from unstructured, free-form text. In order to solve the problem, we propose a three-stage pipeline consisting of Named Entity Recognition (NER), Coreference Resolution (CR), and Relationship Classification (RC). NER identifies tokens in the text that refer to people, such as proper nouns or nicknames, using the SpaCy software. CR maps multiple tokens representing pronouns to their antecedents. For example, CR could map “She”, “His sister”, and “Maria” to the antecedent “Maria Johnson”. CR allows us to transform a genealogical relationship between two tokens, such as the sibling relationship between “him” and “his sister”, to a relationship between the corre- sponding antecedents, for example, “Bob Johnson” and “Maria Johnson”. Our novel algorithm for coreference resolution is based on the AllenNLP software. The last step is the RC, which classifies the relationship between two sets of tokens given adjacent context. We use the LUKE transformer deep-learning model to extract the genealogical relationships. The end-to-end pipeline can extract and correctly classify genealogical relationships from our hand- labeled testing set of Wikipedia documents with macro precision, recall, and F1 scores of 0.794, 0.616, and 0.676, respectively.

Disciplines

Computer Sciences

Number of Pages

8

Share

COinS
 

URL: https://digitalcommons.calpoly.edu/csse_fac/275