Available at: https://digitalcommons.calpoly.edu/theses/3195
Date of Award
12-2025
Degree Name
MS in Computer Science
Department/Program
Computer Science
College
College of Engineering
Advisor
Paul Anderson
Advisor Department
Computer Science
Advisor College
College of Engineering
Abstract
The reproducibility crisis in biomarker discovery stems from traditional approaches that often treat molecular features as independent variables while ignoring the networked nature of biological systems. We present an interpretable-by-design framework that models individual tumor network states using graph attention networks (GATs) to discover robust breast cancer biomarkers. By constraining the search space through biologically informed gene selection and multi-relational graphs integrating protein-protein interactions, pathways, and co-expression networks, we guide the model toward genuine biological relationships rather than spurious correlations. Our ensemble GAT approach achieved 77.4% balanced accuracy for molecular subtype classification. Systematic analysis of attention weights revealed an unexpected finding: 98 of 99 high-confidence biomarkers were terminal nodes rather than network hubs, consistently connecting to established breast cancer drivers including TP53, EGFR, ESR1, and CCND1. We successfully distilled these network-based discoveries into a 70-gene diagnostic panel using interpretable linear models, achieving 75.3% accuracy with expression data alone. Our biologically constrained, interpretable-by-design approach demonstrates how network-guided machine learning yields both mechanistic understanding and reproducible biomarkers.