DOI: https://doi.org/10.15368/theses.2019.128
Available at: https://digitalcommons.calpoly.edu/theses/2529
Date of Award
6-2019
Degree Name
MS in Computer Science
College
College of Agriculture, Food, and Environmental Sciences
Advisor
Maria Pantoja
Advisor Department
<--Please Select Department-->
Advisor College
College of Agriculture, Food, and Environmental Sciences
Abstract
Distributed Systems (DS) where multiple computers share a workload across a network, are used everywhere, from data intensive computations to storage and machine learning. DS provide a relatively cheap and efficient solution that allows stability with improved performance for computational intensive applications. In a DS faults and failures are the norm not the exception. At any moment data corruption can occur especially since a DS usually consists of hundred to thousands of units of commodity hardware. The large number and quality of components guarantees, by probability, that at any given time some of the components will not be working and some of them will not recover from failure. DS can experience problems caused by application bugs, operating systems bugs, failures with disks, memory, connectors, networking, power supply, and other components; therefore, constant monitoring and failure detection are fundamental. Automatic recovery must be integral to the system. One of the most commonly used programming languages for DS is Message Passing Interface (MPI). Unfortunately MPI does not support fault detection or recovery. In this thesis, we build a recovery mechanism based on replicas that works on top of the asynchronous fault detection implemented in previous work. Results shows that our recovery implementation is successful and the overhead in execution time is minimal.