Fault-Tolerance

Previous: Eduction Up: Task Scheduling and Fault-Tolerance Next: Data Structures

Fault-Tolerance

Fault-tolerance is frequently sited as an advantage of distributed systems, while in actuality it's requirement is a downside of the increased number of components in the systems. Data and instructions may be lost or garbled in transmission between processors, and processor nodes may function incorrectly due to a variety of causes. If a resource (processor or communications channel) is generating a large number of errors, it may be identified before program execution and the resource manager will not attempt to utilize the faulty resource. If, on the other hand, an error is detected during execution of a program, the system must take explicit action to recover from the error and prevent the program from failing.

Error Detection

While all computing is subject to failure, different subsystems vary in frequency of faults. The microprocessor core, a large array of memory succeptible to cosmic ray collisions, and a noisy analog communications medium are three systems varying in reliability from excellent to poor. In the case of some subsystems, it is cost effective to perform error detection and recovery internally, in order to improve the reliability of the subsystem to acceptable levels. Examples of subsystems likely to do so are magnetic storage and communications.

It is possible to independently correct many communication errors (e.g. by retransmission or error correcting codes), but some errors, such as complete failure of a communications channel or the node being communicated with (henceforth referred to merely as failure of the node), require higher level error handling. If a faulty subsystem is not capable of perfoming error correction internally, the resource manager is notified.

The mechanism by which the failure of a node assigned to a task is detected relies on a later task (on another node) requiring the data. The consuming task must query (using GET-OBJECT or GET-STREAM-OBJECT) for the result of the producing task in order to force its evaluation. If a certain amount of time has passed since the query with no acknowledgement or result, a second query is issued. If it goes unacknowledged, an error handling routine is invoked. The amount of time waited is specified by the process environment.

If a query has been forwarded, or if a node is capable of executing a task (but hasn't scheduled it yet), it responds to a query with an acknowledge message. This indicates that the additional delay being incurred is not due to processor/network failure.

Processing elements which produce an erroneous output pose a particular problem, since testing their correct operation requires redundant computation of the result and comparison. Instead of performing this continually, I propose a stochastic sampling of the processor element performance, in the form of a monitoring program which occasionally (every several seconds) tests the processing elements for incorrect operation using a redundant method (ideally using results calculated using different machine architectures.) The resource manager would be informed of processors which fail some low threshold of tests, resulting in them immediately being removed from operation -- all data and tasks migrated to alternate processing nodes.

Error Recovery

The price of guaranteeing error detection and recovery (e.g. using redundancy, frequent checkpointing, or atomic operations) is prohibitively high, both in resource utilization and performance. Instead, the MagicEight system guarantees that it will attempt recovery, and should succeed unless the system is ``overloaded'' (memory and/or processing resources are in short supply) or the error is due to software error within the program.

The fault-tolerant nature of functional languages, and eduction in particular, have been noted previously [JA91] [AFJW95]. This is due to the ability of a program to recompute intermediate data values. In the eduction of a graph, if the value of a variable is not found, it is actively demanded. Values are recalculated as necessary to mask the effects of faulty memory, processor or communication resources. Correcting errors at the algorithmic level (i.e. by knowing how to recalculate erroneous data) allows eduction the benefits of fault-tolerance with few of the traditional costs.

Upon detection of an error, the resource manager attempts backwards error recovery [Lap85] first, until it is discovered that a necessary data item is unrecoverable (due to buffer overwrite, for example). If backwards error recovery fails, forward error recovery (in the form of ignoring the missing data) is attempted, and the user is quietly notified. In many cases, where the resulting data is driving output transducers (such as monitors or speakers), the resulting loss will be noticeable, but a high overall system availability may be maintained even in the event of unrecoverable errors.

The logical structure of the processors performing the eduction is such that the catastrophic failure of any single processor or network link should be tolerated. After the detection of such a failure, the logical structure of each process affected is modified to provide tolerance of future faults.

An example of the error recovery algorithm handling a typical error will be added here.

Previous: Eduction Up: Task Scheduling and Fault-Tolerance Next: Data Structures

magiceight-web@media.mit.edu