Recommended Citation
IEEE 17th International Conference on Semantic Computing (ICSC), Winter January 1, 2023, pages 167-174.
The definitive version is available at https://doi.org/10.1109/ICSC56153.2023.00035.
Abstract
The paper tackles the task of extracting genealogical relationships, such as “sibling-of”, “parent-of”, “child-of”, and “spouse-of”, from unstructured, free-form text. In order to solve the problem, we propose a three-stage pipeline consisting of Named Entity Recognition (NER), Coreference Resolution (CR), and Relationship Classification (RC). NER identifies tokens in the text that refer to people, such as proper nouns or nicknames, using the SpaCy software. CR maps multiple tokens representing pronouns to their antecedents. For example, CR could map “She”, “His sister”, and “Maria” to the antecedent “Maria Johnson”. CR allows us to transform a genealogical relationship between two tokens, such as the sibling relationship between “him” and “his sister”, to a relationship between the corre- sponding antecedents, for example, “Bob Johnson” and “Maria Johnson”. Our novel algorithm for coreference resolution is based on the AllenNLP software. The last step is the RC, which classifies the relationship between two sets of tokens given adjacent context. We use the LUKE transformer deep-learning model to extract the genealogical relationships. The end-to-end pipeline can extract and correctly classify genealogical relationships from our hand- labeled testing set of Wikipedia documents with macro precision, recall, and F1 scores of 0.794, 0.616, and 0.676, respectively.
Disciplines
Computer Sciences
Copyright
978-1-6654-8263-9/23/$31.00 ©2023 IEEE
Number of Pages
8
URL: https://digitalcommons.calpoly.edu/csse_fac/275