Master's Theses

Extracting California's Land Grant History: A Comparative Study of Rule-Based and LLM Methods for Historical Document Processing

Available at: https://digitalcommons.calpoly.edu/theses/3306

Date of Award

5-2026

Degree Name

MS in Computer Science

Department/Program

Computer Science

College

College of Engineering

Advisor

Foaad Khosmood

Advisor Department

Computer Science

Advisor College

College of Engineering

Abstract

This thesis explores methods to extract structured data from historical documents, making them more accessible for analysis and visualization. The goal is to provide historians and Digital Humanities scholars with tools that save time and reduce the manual effort required for this kind of research. We work with census records and historical texts from the Early California Population Project and Spanish Colonial census, covering California's 18th-19th century transition from Spanish and Mexican control to U.S. statehood. The system has three components: person matching across records, family tree generation, and land grant extraction. For person matching, we use a modified Levenshtein distance algorithm to link individuals across census and mission records, accounting for Spanish colonial name variations. Matched records feed into family tree generation, connecting ancestors to descendants across generations. The main focus of this thesis is land grant extraction from California Ranchos, a book documenting private land grants across California counties. We develop a rule-based NLP pipeline using spaCy to extract grant names, transaction histories, and geographic coordinates from semi-structured historical text. We then compare this approach against Large Language Models (GPT-4o and Grok-3) to evaluate whether modern LLMs can outperform traditional NLP methods on domain-specific historical documents. To ensure fair comparison, we built a human-annotated golden set of 100 land grants and developed a unified scoring algorithm that weights fields by importance: identification fields (3 points), transaction details (2 points), and supplementary information (1 point). To evaluate statistical significance, we ran each LLM seven times and applied bootstrap resampling (n=100) to the deterministic rule-based methods. Results show that LLM extraction (Grok-3: 84.7\%, GPT-4o: 83.7\%) outperforms baseline rule-based extraction (80.3\%) by 3-4 percentage points. However, targeted post-processing enhancements addressing OCR artifacts, coordinate formatting, and name normalization improved the rule-based approach to 83.0, narrowing the gap to just 1.7 percentage points. The largest remaining differences appear in supplementary fields requiring contextual interpretation. Our hope is that this work demonstrates the viability of automated extraction for historical records and provides useful methodology for the Digital Humanities community.

Download

COinS

Master's Theses

Extracting California's Land Grant History: A Comparative Study of Rule-Based and LLM Methods for Historical Document Processing

Date of Award

Degree Name

Department/Program

College

Advisor

Advisor Department

Advisor College

Abstract

Search

Browse

Author Corner

LINKS

Master's Theses

Extracting California's Land Grant History: A Comparative Study of Rule-Based and LLM Methods for Historical Document Processing

Author

Date of Award

Degree Name

Department/Program

College

Advisor

Advisor Department

Advisor College

Abstract

Share

Search

Browse

Author Corner

LINKS