Available at: https://digitalcommons.calpoly.edu/theses/1561
Date of Award
MS in Computer Science
As the scale of supercomputers grows, it is becoming increasingly important for software to efficiently withstand hardware and software faults. Process replication is one resilience technique, but typical implementations require replicas to stay closely synchronized with each other. We propose algorithms to lazily detect faults in replicated MPI applications, allowing for more flexibility in replica scheduling and potential power savings. Evaluation shows that, when all processes are operated at full power, this approach allows applications to complete substantially faster as compared to using a synchronized model, and often as fast as in non-replicated execution.