Available at: https://digitalcommons.calpoly.edu/theses/3232
Date of Award
3-2026
Degree Name
MS in Computer Science
Department/Program
Computer Science
College
College of Engineering
Advisor
Paul Anderson
Advisor Department
Computer Science
Advisor College
College of Engineering
Abstract
Carbonic anhydrases (CAs) catalyze the reversible hydration of CO2 and have evolved independently at least eight times, resulting in structurally distinct enzyme families (α, β, γ, δ, ζ, η, θ, ι). Traditional sequence alignment methods struggle to classify these convergently evolved proteins because their sequential similarity does not reliably indicate functional or evolutionary relationships. Many CA sequences in public databases are annotated generically without family assignments, and prior computational approaches have focused predominantly on the three well characterized families (α, β, γ), leaving the five recently discovered classes without robust classification tools. Family level assignment is often a prerequisite for functional inference, inhibitor design, and phylogenetic analysis, as CA classes employ varying active site geometries, oligomeric structures, and catalytic mechanisms, which are difficult to predict from sequence alone.
In this thesis, we constructed family specific profile Hidden Markov Models (HMMs) for all eight CA families using curated seed alignments of 575 sequences. The resulting HMM profiles achieved 100% classification accuracy on held-out test sequences (n=119) and successfully classified 148,727 UniProt CA sequences, with 94.8% assigned to specific families. Bit score and entropy analysis revealed distinct intra-family variance, confirming that the models capture meaningful sequence differences, differentiating family models. These profiles provide the first comprehensive HMM based classification framework spanning all recognized CA families, enabling systematic annotation of carbonic anhydrase datasets.