Available at: http://digitalcommons.calpoly.edu/theses/1040
Date of Award
MS in Computer Science
Bioinformatics is the field of science in which biology, computer science, and information technology merge to form a single discipline. The ultimate goal of the field is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned. Bioinformatics involves the analysis of various types of data from multiple sources to create a model of a physical process found in the real world. To do the necessary modeling, data is being piled up in various different biological databases. So far this has worked relatively well. Unfortunately the amount of biological data being generated is increasing exponentially. Currently, there are several problems with how biological data is stored. The most depressing issue comes from a survey done by Bry and Kröger in 2003. It found that out of 111 biological databases 40-44 were collections of flat files and 41-42 were relational databases. Both of these types of systems have serious scaling limits. To store all of the biological data in the future, distributed systems are needed.
There exist three types of distributed systems: Consistent Available (CA), Consistent Partition-Tolerant (CP), Available Partition-Tolerant (AP). We argue that AP systems best meet the biologists' requirements for several reasons. First, that the workloads commonly run on biological databases are reads. So heavy consistency requirements are not needed. Second, the workloads are non-transactional so there is no need for the ACID constraint commonly found in relational databases. Lastly, availability is important because research should not be hampered by long database queries.
As a proof-of-concept application to show that an AP system works well, we needed a bioinformatics problem to tackle. Dr. Anya Goodman, a professor in the department of Chemistry and Biochemistry at Cal Poly, San Luis Obispo, offers a bioinformatics course that covers several aspects of gene annotation and genomic research. Currently there does not exist a system in which users can input and save gene annotations and get immediate feedback regarding their performance. Thus, the Community Genome Annotation Training (CGAT) database was born. In our evaluations we compared a MySQL, Couchbase, and MongoDB implementation of the CGAT back end and found that MongoDB (an AP system) performed the best for the workloads expected on CGAT.