Date of Award

3-2026

Degree Name

MS in Computer Science

Department/Program

Computer Science

College

College of Engineering

Advisor

Paul Anderson

Advisor Department

Computer Science

Advisor College

College of Engineering

Abstract

Carbonic anhydrases (CAs) catalyze the reversible hydration of CO2 and have evolved independently at least eight times, resulting in structurally distinct enzyme families (α, β, γ, δ, ζ, η, θ, ι). Traditional sequence alignment methods struggle to classify these convergently evolved proteins because their sequential similarity does not reliably indicate functional or evolutionary relationships. Many CA sequences in public databases are annotated generically without family assignments, and prior computational approaches have focused predominantly on the three well characterized families (α, β, γ), leaving the five recently discovered classes without robust classification tools. Family level assignment is often a prerequisite for functional inference, inhibitor design, and phylogenetic analysis, as CA classes employ varying active site geometries, oligomeric structures, and catalytic mechanisms, which are difficult to predict from sequence alone.

In this thesis, we constructed family specific profile Hidden Markov Models (HMMs) for all eight CA families using curated seed alignments of 575 sequences. The resulting HMM profiles achieved 100% classification accuracy on held-out test sequences (n=119) and successfully classified 148,727 UniProt CA sequences, with 94.8% assigned to specific families. Bit score and entropy analysis revealed distinct intra-family variance, confirming that the models capture meaningful sequence differences, differentiating family models. These profiles provide the first comprehensive HMM based classification framework spanning all recognized CA families, enabling systematic annotation of carbonic anhydrase datasets.

Share

COinS