Available at: https://digitalcommons.calpoly.edu/theses/2863
Date of Award
6-2024
Degree Name
MS in Statistics
Department/Program
Statistics
College
College of Science and Mathematics
Advisor
Trevor Ruiz
Advisor Department
Statistics
Advisor College
College of Science and Mathematics
Abstract
Understanding marine mammal populations and how they are affected by human activity and ocean conditions is vital, especially in tracking population declines and monitoring endangered species. However, tracking marine mammal populations and their distribution is challenging due to difficulties in observation and costs. Using surrounding plankton environmental DNA (eDNA) has the potential to provide an indirect measure of monitoring cetacean abundances based on ecological associations. This project aims to apply statistical methods to assess the relationship of visual abundances of common species of baleen whales with amplicon sequence variants (ASV) of plankton eDNA samples from the NOAA-CalCOFI Ocean Genomics (NCOG) project. Modeling this relationship of eDNA with marine mammal sightings may greatly aid the ability to predict the abundance of whales in the ocean.
There are several key challenges associated with the analysis of this NCOG data. Plankton eDNA samples are an example of compositional data, where the proportions of each ASV must sum to one; this provides a challenging constraint for statistical analysis and interpretation. High dimensionality (the number of parameters exceeds the observations) and sparsity (many observed zeros) of the genetic sequencing data also pose challenges in estimating parameters. Finally, the model associations should be adjusted for related factors, including seasonality and oceanographic factors, the latter of which goes beyond this project's scope.
This thesis develops and fits models to estimate cetacean abundance from plankton eDNA by leveraging methods of compositional data analysis and high-dimensional regression. This project applies log-ratio data transformations and corresponding log-contrast models to address the compositional aspect of eDNA reads. Regression methods involving high-dimensional data typically rely on dimensionality reduction or regularization. This project implements both reduction and regularization through sparse partial least squares (sPLS) regression. In addition to the data modeling objective of using plankton eDNA to predict baleen whale abundances, this project also identifies ecological correlations between whale abundance and plankton eDNA.