Date of Award

5-2026

Degree Name

MS in Computer Science

Department/Program

Computer Science

College

College of Engineering

Advisor

Foaad Khosmood

Advisor Department

Computer Science

Advisor College

College of Engineering

Abstract

This thesis explores methods to extract structured data from historical documents, making them more accessible for analysis and visualization. The goal is to provide historians and Digital Humanities scholars with tools that save time and reduce the manual effort required for this kind of research. We work with census records and historical texts from the Early California Population Project and Spanish Colonial census, covering California's 18th-19th century transition from Spanish and Mexican control to U.S. statehood. The system has three components: person matching across records, family tree generation, and land grant extraction. For person matching, we use a modified Levenshtein distance algorithm to link individuals across census and mission records, accounting for Spanish colonial name variations. Matched records feed into family tree generation, connecting ancestors to descendants across generations. The main focus of this thesis is land grant extraction from California Ranchos, a book documenting private land grants across California counties. We develop a rule-based NLP pipeline using spaCy to extract grant names, transaction histories, and geographic coordinates from semi-structured historical text. We then compare this approach against Large Language Models (GPT-4o and Grok-3) to evaluate whether modern LLMs can outperform traditional NLP methods on domain-specific historical documents. To ensure fair comparison, we built a human-annotated golden set of 100 land grants and developed a unified scoring algorithm that weights fields by importance: identification fields (3 points), transaction details (2 points), and supplementary information (1 point). To evaluate statistical significance, we ran each LLM seven times and applied bootstrap resampling (n=100) to the deterministic rule-based methods. Results show that LLM extraction (Grok-3: 84.7\%, GPT-4o: 83.7\%) outperforms baseline rule-based extraction (80.3\%) by 3-4 percentage points. However, targeted post-processing enhancements addressing OCR artifacts, coordinate formatting, and name normalization improved the rule-based approach to 83.0, narrowing the gap to just 1.7 percentage points. The largest remaining differences appear in supplementary fields requiring contextual interpretation. Our hope is that this work demonstrates the viability of automated extraction for historical records and provides useful methodology for the Digital Humanities community.

Share

COinS