Component Based Design of Fault-Tolerance


Sandeep Kulkarni

Abstract

In this dissertation, we defend the thesis that a fault-tolerant program is a composition of a fault-intolerant program and a set of {\em fault-tolerance components}. To validate this thesis, we identify a basis set of fault-tolerance components, namely {\em detectors} and {\em correctors}. We show that depending upon the desired level of fault-tolerance, detectors, correctors, or both are necessary and sufficient for designing a rich class of fault-tolerant programs. Moreover, the fault-tolerant programs designed using existing methods such as replication and Schneider's state machine approach can be alternatively designed in terms of detectors and correctors.

Using these fault-tolerance components, we present a method for designing multitolerant programs, i.e., programs that tolerate multiple types of faults while providing potentially different levels of tolerance to each type of fault. Our method accommodates all types of faults, including process faults, communication faults, hardware and software faults, network failure and security intrusions, and it can be used in several application domains such as distributed systems, computer networks and parallel systems.

Using several examples, we illustrate how our fault-tolerance components can be used in the design of several fault-tolerant programs. Among these examples, we discuss in detail the problem of distributed reset. A distributed reset operation reinitializes a distributed system to a given global state. Our distributed reset program is the first bounded state program that masks fail-stop and repair of processes and stabilizes in the presence of transient faults. In other words, it guarantees that if only fail-stop and repair faults occur then every reset operation is correct and if transient faults occur then eventually the program reaches a state from where subsequent reset operations are correct. Towards designing this distributed reset program, we have designed multitolerant components that are useful in adding multitolerance to several other applications.

We also argue that the decomposition of a fault-tolerant program into its fault-tolerance components is useful in verification of that program. Towards this end, we present a case study that illustrates our experience in verifying Dijkstra's token ring program with the theorem prover PVS.


Paper:

Slides


Return to the publication list
Return to the Sandeep's home page