Available at: https://digitalcommons.calpoly.edu/theses/3141
Date of Award
6-2025
Degree Name
MS in Statistics
Department/Program
Statistics
College
College of Science and Mathematics
Advisor
Kelly Bodwin
Advisor Department
Statistics
Advisor College
College of Science and Mathematics
Abstract
This thesis introduces a new implementation of the BIRCH clustering algorithm within the tidyclust framework in R. Traditional hierarchical clustering methods face scalability limitations with large datasets, due to their computational complexity. BIRCH offers a scalable alternative by summarizing data into microclusters using a CF-tree. This work shows the integration of the phases of the BIRCH algorithm into tidyclust, which enables a streamlined workflow for model specification, evaluation, and prediction. Through this implementation, scalable hierarchical clustering has been brought to R users within the tidyclust interface, enhacing the ability to better analyze large datasets.