Explain Reaching Agreement.
For a system to be reliable, we need a mechanism that allows a set of processes to agree on a common value. Such an agreement may not take place, for several reasons. First, the communication medium may be faulty, resulting in lost or garbled messages. Second, the processes themselves may be faulty, resulting in unpredictable process behavior.
The best we can hope for in this case is that processes fail in a clean way, stopping their execution without deviating from their normal execution pattern. In the worst case, processes may send garbled or incorrect messages to other processes or even collaborate with other failed processes in an attempt to destroy the integrity of the system.
The Byzantine generals problem provides an analogy for this situation. Several divisions of the Byzantine army, each commanded by its own general, surround an enemy camp. The Byzantine generals must reach agreement on whether or not to attack the enemy at dawn. It is crucial that all generals agree, since an attack by only some of the divisions would result in defeat.
The various divisions are geographically dispersed, and the generals can communicate with one another only via messengers who run from camp to camp. The generals may not be able to reach agreement for at least two major reasons:
1. Messengers may get caught by the enemy and thus may be unable to deliver their messages. This situation corresponds to unreliable communication in a computer system and is discussed further in Section 18.7.1.
2. Generals may be traitors, trying to prevent the loyal generals from reaching an agreement. This situation corresponds to faulty processes in a computer system and is discussed further in Section 18.7.2.
Let us first assume that, if processes fail, they do so in a clean way and that the communication medium is unreliable. Suppose that process P,- at site Si, which has sent a message to process P; at site S2, needs to know whether Pj has received the message so that it can decide how to proceed with its computation. For example, P, may decide to compute a function foo if Pj has received its message or to compute a function boo if Pj has not received the message (because of some hardware failure). To detect failures, we can use a time-out scheme similar to the one described in Section 16.7.1.
When P, sends out a message, it also specifies 18.7 Reaching Agreement 687 a time interval during which it is willing to wait for an acknowledgment message from P,. When P, receives the message, it immediately sends an acknowledgment to P,-. If P; receives the acknowledgment message within the specified time interval, it can safely conclude that P, has received its message. If, however, a time-out occurs, then P, needs to retransmit its message and wait for an acknowledgment. This procedure continues until P, either gets the acknowledgment message back or is notified by the system that site En is down. In the first case, it will compute S; in the latter case, it will compute F.
Note that, if these are the only two viable alternatives, P; must wait until it has been notified that one of the situations has occurred.
Suppose now that P< also needs to know that P,- has received its acknowledgment message, so that it can decide how to proceed with its computation. For example, Pj may want to compute foo only if it is assured that P>- got its acknowledgment. In other words, P; and Pj will compute foo if and only if both have agreed on it. It turns out that, in the presence of failure, it is not possible to accomplish this task. More precisely, it is not possible in a distributed environment for processes P; and Pr to agree completely on their respective states.
To prove this claim, let us suppose that a minimal sequence of message transfers exists such that, after the messages have been delivered, both processes agree to compute foo. Let in' be the last message sent by P, to Pj. Since P; does not know whether its message will arrive at Pj (since the message may be lost due to a failure), P; will execute foo regardless of the outcome of the message delivery. Thus, m' could be removed from the sequence without affecting the decision procedure. Hence, the original sequence was not minimal, contradicting our assumption and showing that there is no sequence. The processes can never be sure that both will compute foo.
Now let us assume that the communication medium is reliable but that processes can fail in unpredictable ways. Consider a system of n processes, of which no more than m are faulty. Suppose that each process P; has some private value of V/. We wish to devise an algorithm that allows each nonfaulty process P, to construct a vector X, = (A.i, A.2- •••, A,n) such that the following conditions exist: 1. If Pj is a nonfaulty process, then Aj.j = Vj. 2. If P, and Pj are both nonfaulty processes, then X,- = Xj. There are many sokitions to this problem, and they share the following properties:
1. A correct algorithm can be devised only if n > 3 x m + 1.
2. The worst-case delay for reaching agreement is proportionate to in + 1 message-passing delays.
3. The number of messages required for reaching agreement is large. No single process is trustworthy, so all processes must collect all information and make their own decisions Rather than presenting a general solution, which would be complicated, we present an algorithm for the simple case where m = 1 and n - 4.
The algorithm requires two rounds of information exchange:
1. Each process sends its private value to the other three processes.
2. Each process sends the information it has obtained in the first round to all other processes. A faulty process obviously may refuse to send messages. In this case, a nonfaulty process can choose an arbitrary value and pretend that the value was sent by the faulty process. Once these two rounds are completed, a nonfaulty process P, can construct its vector X, = (A.i, A.2/ A.3/ A.4) as follows:
2. For j ^ i, if at least two of the three values reported for process Pj (in the two rounds of exchange) agree, then the majority value is used to set the value of A,/- Otherwise, a default value—say, nil—is used to set the value of A.,.
Frequently Asked Questions
- Operating System Concepts ( Multi tasking, multi programming, multi-user, Multi-threading )
- Different Types of Operating Systems
- Batch Operating Systems
- Time sharing operating systems
- Distributed Operating Systems
- Network Operating System
- Real Time operating System
- Various Operating system services
- Architectures of Operating System
- Monolithic architecture - operating system
- Layered Architecture of Operating System
- Microkernel Architecture of operating system
- Hybrid Architecture of Operating System
- System Programs and Calls
- Process Management - Process concept