Available at: http://digitalcommons.calpoly.edu/theses/1511
Date of Award
MS in Computer Science
As part of an ongoing multidisciplinary effort at California Polytechnic State University, biologists and computer scientists have developed a new Library-Dependent Microbial Source Tracking method for identifying the host animals causing fecal contamination in local water sources. The Cal Poly Library of Pyroprints (CPLOP) is a database which stores E. coli representations of fecal samples from known hosts acquired from a novel method developed by the biologists called Pyroprinting. The research group considers E. coli samples whose Pyroprints match above a certain threshold to be part of the same bacterial strain. If an environmental sample from an unknown host animal matches one of the strains in CPLOP, then it is likely that the host of the unknown sample is the same species as one of the hosts that the strain was previously found in. The computer science technique for finding groups of related data (ie. strains) in a data set is called clustering. In this thesis, we evaluate the use of density-based clustering for identifying strains in CPLOP. Density-based clustering finds clusters of points which have a minimum number of other points within a given radius. We contribute a clustering algorithm based on the original DBSCAN algorithm which removes points from the search space after they have been seen once. We also present a new method for comparing Pyroprints which is algebraically related to the current method. The method has mathematical properties which make it possible to use Pyroprints in a spatial index we designed especially for Pyroprints, which can be utilized by the DBSCAN algorithm to speed up clustering.