Date of Award

6-2026

Degree Name

MS in Computer Science

Department/Program

Computer Science

College

College of Engineering

Advisor

Paul Anderson

Advisor Department

Computer Science

Advisor College

College of Engineering

Abstract

Large protein databases now make it possible to study short peptides across natural protein sequence space at unprecedented scale, but exhaustively counting k-mers across billions of protein sequences remains computationally difficult. This thesis develops an exact amino-acid k-mer counting method based on direct addressing, in which fixed-length amino-acid strings are encoded as base-20 integers and updated with a sliding-window recurrence. By avoiding key storage and collision resolution, this approach removes overhead inherent to hash-map-based methods when the k-mer space is sufficiently dense. A memory analysis shows when direct addressing is preferable to open-addressing hash tables, and expected-saturation calculations motivate its use for GigaRef-scale datasets up to k=9.

The counting pipeline is applied to more than 557 billion overlapping 8-mers from GigaRef, producing a complete count table over all 25.6 billion possible amino-acid 8-mers. This table enables analysis of missing, rare, and frequent peptides at full sequence-space scale. We find that 6.53 billion possible 8-mers are absent from GigaRef, substantially more than expected under codon-degeneracy and observed amino-acid/dimer-frequency null models. To characterize these deviations, the thesis uses null-model analysis, logistic regression, negative-binomial neural networks, permutation feature importance, and composition-preserving permutation tests. The results show that codon degeneracy is a strong null model, while biological cost-related features explain additional variation in 8-mer occurrence and abundance. A supplementary feature-importance analysis shows that negative-binomial models with similar predictive performance can produce different feature rankings, suggesting that individual biological feature rankings should be interpreted cautiously. Finally, comparison with ProteinMPNN-generated sequences shows how exact natural k-mer counts can be used to evaluate biases in protein sequence design models. Together, these results provide evidence for structured constraints in protein sequence space and show how large-scale k-mer analysis can be used to diagnose biases in generative protein models.

Share

COinS