DOI: https://doi.org/10.15368/theses.2022.117
Available at: https://digitalcommons.calpoly.edu/theses/2628
Date of Award
12-2022
Degree Name
MS in Computer Science
Department/Program
Computer Science
College
College of Engineering
Advisor
Paul Anderson
Advisor Department
Computer Science
Advisor College
College of Engineering
Abstract
In the realm of biomedical technology, both accuracy and consistency are crucial to the development and deployment of these tools. While accuracy is easy to measure, consistency metrics are not so simple to measure, especially in the scope of biomedicine where prediction consistency can be difficult to achieve. Typically, biomedical datasets contain a significantly larger amount of features compared to the amount of samples, which goes against ordinary data mining practices. As a result, predictive models may fail to find valid pathways for prediction during training on such datasets. This concept is known as underspecification.
Underspecification has been more accepted as a concept in recent years, with a handful of recent works exploring underspecification in different applications and a handful of past works experiencing underspecification prior to its declaration. However, underspecification is still under-addressed, to the point where some academics might even claim that it is not a significant problem.
With this in mind, this thesis aims to identify and minimize underspecification of deep learning cancer subtype predictors. To address these goals, this work details the development of Predicting Underspecification Monitoring Pipeline (PUMP), a software tool to provide methodology for data analysis, stress testing, and model evaluation. In this context, the hope is that PUMP can be applied to deep learning training such that any user can ensure that their models are able to generalize to new data as best as possible.
Included in
Biomedical Commons, Biomedical Informatics Commons, Computer Engineering Commons, Medical Genetics Commons