Available at: https://digitalcommons.calpoly.edu/theses/3378
Date of Award
6-2026
Degree Name
MS in Computer Science
Department/Program
Computer Science
College
College of Engineering
Advisor
Paul Anderson
Advisor Department
Computer Science
Advisor College
College of Engineering
Abstract
Large protein databases now make it possible to study short peptides across natural protein sequence space at unprecedented scale, but exhaustively counting k-mers across billions of protein sequences remains computationally difficult. This thesis develops an exact amino-acid k-mer counting method based on direct addressing, in which fixed-length amino-acid strings are encoded as base-20 integers and updated with a sliding-window recurrence. By avoiding key storage and collision resolution, this approach removes overhead inherent to hash-map-based methods when the k-mer space is sufficiently dense. A memory analysis shows when direct addressing is preferable to open-addressing hash tables, and expected-saturation calculations motivate its use for GigaRef-scale datasets up to k=9.
The counting pipeline is applied to more than 557 billion overlapping 8-mers from GigaRef, producing a complete count table over all 25.6 billion possible amino-acid 8-mers. This table enables analysis of missing, rare, and frequent peptides at full sequence-space scale. We find that 6.53 billion possible 8-mers are absent from GigaRef, substantially more than expected under codon-degeneracy and observed amino-acid/dimer-frequency null models. To characterize these deviations, the thesis uses null-model analysis, logistic regression, negative-binomial neural networks, permutation feature importance, and composition-preserving permutation tests. The results show that codon degeneracy is a strong null model, while biological cost-related features explain additional variation in 8-mer occurrence and abundance. A supplementary feature-importance analysis shows that negative-binomial models with similar predictive performance can produce different feature rankings, suggesting that individual biological feature rankings should be interpreted cautiously. Finally, comparison with ProteinMPNN-generated sequences shows how exact natural k-mer counts can be used to evaluate biases in protein sequence design models. Together, these results provide evidence for structured constraints in protein sequence space and show how large-scale k-mer analysis can be used to diagnose biases in generative protein models.
Included in
Bioinformatics Commons, Computational Biology Commons, Computational Engineering Commons