> Raft seems a lot more understandable and simpler than alternatives (Paxos, > Viewstamped Replication) mentioned in the paper, with fewer possible states to > consider. What are the tradeoffs Raft pays for this simplicity? Raft gives up some performance in return for clarity; for example: * Every operation must be written to disk for persistence; performance probably requires batching many operations into each disk write. * There can only usefully be a single AppendEntries in flight from the leader to each follower: followers reject out-of-order AppendEntries, and the sender's nextIndex[] mechanism requires one-at-a-time. A provision for pipelining many AppendEntries would be better. * The snapshotting design is only practical for relatively small states, since it writes the entire state to disk. If the state is big (e.g. if it's a big database), you'd want a way to write just parts of the state that have changed recently. * Similarly, bringing recovering replicas up to date by sending them a complete snapshot will be slow, needlessly so if the replica already had an old snapshot. * Servers may not be able to take much advantage of multi-core because operations must be executed one at a time (in log order). All of these could be fixed by modifying Raft, but the result would likely have less value as a tutorial introduction. > Is raft used in real-world software, or do companies generally roll > their own flavor of Paxos (or use a different consensus protocol)? Raft is pretty new, so there hasn't been much time for people to design new systems based on it. Most of the state-machine replication systems I hear about are based on the older Multi-Paxos and Viewstamped Replication protocols. > How did RAFT designers design the system to be as efficient as Paxos > while maintaining learnability? Were the hard-to-learn portions of > Paxos simply a result of semantics? There is a protocol called Paxos that allows a set of servers to agree on a single value. While Paxos requires some thought to understand, it is far simpler than Raft. But Paxos solves a much smaller problem than Raft. Here's an easy-to-read paper about Paxos: http://css.csail.mit.edu/6.824/2014/papers/paxos-simple.pdf To build a real-world replicated service, the replicas need to agree on an indefinite sequence of values (the client commands), and they need ways to efficiently recover when servers crash and restart or miss messages. People have built such systems with Paxos as the starting point; look up Google's Chubby and Paxos Made Live papers, and ZooKeeper/ZAB. There is also a protocol called Viewstamped Replication; it's a good design, and similar to Raft, but the paper about it is hard to understand. These real-world protocols are complex, and (before Raft) there was not a good introductory paper describing how they work. The Raft paper, in contrast, is relatively easy to read. That's a big contribution. Whether the Raft protocol is inherently easier to understand is not clear. The issue is clouded by a lack of good descriptions of other real-world protocols. In addition, Raft sacrifices performance for clarity in a number of ways; that's fine for a tutorial but not always desirable in a real-world protocol. > How long had Paxos existed before the authors created Raft? How > widespread is Raft's usage in production now? Paxos was invented in the late 1980s. Raft was developed in roughly 2012, over 20 years after Paxos. Raft closely resembles a protocol called Viewstamped Replication, originally published in 1988. There were replicated fault-tolerant file servers built on top of Viewstamped Replication in the early 1990s, though not in production use. I know of a bunch of real-world systems that are derived from Paxos: Chubby, Spanner, Megastore, and Zookeeper/ZAB. Starting in the early or mid-2000s big web sites and cloud providers needed fault-tolerant services, and Paxos was more or less re-discovered at that time and put into wide production use. There are several real-world users of Raft: Docker (https://docs.docker.com/engine/swnarm/raft/) and etcd (https://etcd.io). Others systems said to be using Raft include CockroachDB, RethinkDB, and TiKV, but doubtless there are others. Maybe you can find others starting at http://raft.github.io/ > How does Raft's performance compare to Paxos in real-world applications? The fastest Paxos-derived protocols are probably much faster than Raft; have a look at ZAB/ZooKeeper and Paxos Made Live. But I suspect at the expense of making it more complex one could evolve Raft to have very good performance. Etcdv3 etcd3 claim to have achieved better performance than zookeeper and many paxos based implementations (https://www.youtube.com/watch?v=hQigKX0MxPw). > Is it becoming more popular to design algorithms for > understandability, or is Raft a special case? It's rare to see a research paper published where the main contribution is clarity. > How did the authors come up with such a system with these precise > rules? Did they start with a simple one, and then gradually refine it, > adding rules to address violations of properties as they came up? Or > did they start with some overarching model for how such systems should > work (and then refine it iteratively)? I don't know. Raft's design is similar to existing practice, going back to the Viewstamped Replication protocol published in 1988. Randomized leader election dates back to at least the early 1970s. So there was long tradition of design and engineering for them to draw on if they wanted to. For the details, perhaps they had some kind of iterative process in which they cycled through design, implementataion or simulation, debugging of problems, and then changes to the design. Or perhaps drove the refinement by trying to prove correctness rather than by testing. > Why are we learning/implementing Raft instead of Paxos? We're using Raft in 6.824 because there is a paper that clearly describes how to build a complete replicated server system using Raft. I know of no satisfactory paper that describes how to build a complete replicated server system based on Paxos. > The paper says that Paxos implementations for real problems have to be > changed significantly from the correctness-proved version of Paxos, > but that this is less true for Raft because Raft is easier to > understand and therefore easier to apply to real distributed consensus > applications. Presumably, some changes still need to be made when Raft > is applied in real systems. What kinds of changes are those? Are any > of the rules ever relaxed, maybe because simplicity is worth > sacrificing correctness in rare border cases? Or are changes to the > architecture made? I’m also not sure what kinds of changes > the paper was talking about when it said significant changes have to > be made to Paxos in real applications, and I’m curious how > that parallels Raft. In the case of Paxos, the original Paxos was not a replicated state machine system; it only performed a single agreement. So everyone who used it as the basis for replication had to invent a way to use it to agree to a sequence of client commands. It turns out if you want efficiency, you need a leader as well, so most serious uses of Paxos also had to add algorithms to elect and use a leader. Similarly for efficient schemes to bring lagging replicas up to date, to checkpoint and restore service state, to serve read-only operations without having to add log entires, to change the set of servers in the replica group, and to detect re-sent client requests and execute them only once. Raft already has all of these features. Perhaps there are things that real-world uses still need to change about Raft, but likely not nearly as many as for Paxos-based systems. > What are the big differences/advantages/disadvantages between the two? Raft's design is pretty close to many designs based on Paxos. There are a couple of ways in which Raft sacrifices performance to gain simplicity. Raft doesn't pipeline/overlap successive operations. It doesn't have particularly efficient ways to bring lagging follower up to date quickly, or to persist Raft state to disk. Raft's snapshotting is OK for tiny states but very slow for a service with a large state. It is too bad the Raft paper doesn't include a serious performance evaluation that would help us understand how much all this matters for performance. > 1) Are there any good resources to gain a quick idea of what Paxos is? > It seems like that would be useful as context for the Raft system. Here's an introduction: http://css.csail.mit.edu/6.824/2014/papers/paxos-simple.pdf And here's a recorded lecture of Paxos extended to be comparable to Raft: https://www.youtube.com/watch?v=JEpsBg0AO6o > 2) Are there systems like Raft that can survive and continue to > operate (commit new log entries) when only a minority of the > cluster is active? If so, how, and if not, is it provably > impossible to have such a system? Not with Raft's properties. But you can do it with different assumptions, or different client-visible semantics. The basic problem is split-brain -- the possibility of more than one server thinking it is the leader. There are two approaches that I know of. If somehow clients and servers can learn exactly which servers are live and which are dead (as opposed to live but partitioned by network failure), then one can build a system that can function as long as one is alive, picking (say) the lowest-numbered server known to be alive. However, it's hard for one computer to decide if another computer is dead, as opposed to the network losing the messages between them. One way to do it is to have a human decide -- the human can inspect each server and decide which are alive and dead. This is probably how GFS decides when the master is dead and it should switch to a backup. The other approach is to allow split-brain operation, and to have a way for servers to reconcile the resulting diverging state after partitions are healed. This can be made to work for some kinds of services, but has complex client-visible semantics (usually called "eventual consistency"). Have a look at the Bayou and Dynamo papers which are assigned later in the course. > One of the invariants that is implicitly assumed in raft is, that the > total number of servers are constant. This posses an extra requirement > that atleast half of these servers must be alive for the system to > function. Is there a way where one can get rid of this requirement? The problem here is split-brain -- the possibility of more than one server thinking it is the leader. There are two approaches that I know of. If somehow clients and servers can learn exactly which servers are live and which are dead (as opposed to live but partitioned by network failure), then one can build a system that can function as long as one is alive, picking (say) the lowest-numbered server known to be alive. However, it's hard for one computer to decide if another computer is dead, as opposed to the network losing the messages between them. One way to do it is to have a human decide -- the human can inspect each server and decide which are alive and dead. This is probably how GFS decides when the master is dead and it should switch to a backup. The other approach is to allow split-brain operation, and to have a way for servers to reconcile the resulting diverging state after partitions are healed. This can be made to work for some kinds of services, but has complex client-visible semantics (usually called "eventual consistency"). Have a look at the Bayou and Dynamo papers which are assigned later in the course. > In Raft, the service which is being replicated is > not available to the clients during an election process. > In practice how much of a problem does this cause? The client-visible pause seems likely to be on the order of a tenth of a second. The authors expect failures (and thus elections) to be rare, since they only happen if machines or the network fails. Many servers and networks stay up continuously for months or even years at a time, so this doesn't seem like a huge problem for many applications. > Are there other concensus systems that don't have any downtime? There are versions of Paxos-based replication that do not have a leader or elections, and thus don't suffer from pauses during elections. Instead, any server can effectively act as leader at any time. The cost of not having a leader is that more messages are required for each agreement. > How are Raft and VMware FT related? What are the major differences and their > implications? VM-FT can replicate any virtual machine guest, and thus any server-style software. Raft can only replicate software designed specifically for replication for Raft. On the other hand, VM-FT punts a critical decision to the test-and-set feature of the shared disk -- if the shared disk fails, VM-FT won't work. Raft would be a good way to build a fault-tolerant shared disk. > Why would I use VM-FT instead of Raft? If you had some existing software you wanted to replicate but didn't want to go to the effort of making it work with Raft. > What is the best way to protect the RAFT system against having a > machine taken over by adversaries, becoming a leader and than putting > damage on the stored information? You're right that Raft doesn't have any defense against attacks like this. It assumes that all participants are following the protocol, and that only the correct set of servers is participating. A real deployment would have to do better than this. The most straightforward option is to place the servers behind a firewall to filter out packets from random people on the Internet, and to ensure that all computers and people inside the firewall are trustworthy. There may be situations where Raft has to operate on the same network as potential attackers. In that case a good plan would be to authenticate the Raft packets with some cryptographic scheme. For example, give each legitimate Raft server a public/private key pair, have it sign all the packets it sends, give each server a list of the public keys of legitimate Raft servers, and have the servers ignore packets that aren't signed by a key on that list. > The paper mentions that Raft works under all non-Byzantine conditions. > What are Byzantine conditions and why could they make Raft fail? The paper's "non-Byzantine conditions" refers to situations where the server software and hardware either work correctly or simply stop executing, and where the network may lose, delay, or re-order messages. The paper says that Raft is safe under these conditions. For example, most power failures are non-Byzantine because they cause computers to simply stop executing instructions; if a power failure occurs, Raft may stop operating, but it won't send incorrect results to clients. Any failure other than those mentioned above is called a Byzantine failure. For example, if the Raft software on the servers has bugs that cause it to not follow the Raft protocol correctly. Or if the CPU executes instructions incorrectly (e.g. adds 100 plus 100 and gets 201). Or if attackers break into one of the servers and modify its Raft software or stored state, or if attackers forge RPCs from/to Raft servers. If a failure like this occurs, Raft may send incorrect results to clients. Most of 6.824 is about tolerating non-Byzantine faults. Correct operation despite Byzantine faults is more difficult; we'll touch on this topic at the end of the term. > In figure 1, what does the interface between client and server look > like? Typically an RPC interface to the server. For a key/value storage server such as you'll build in Lab 3, it's Put(key,value) and Get(value) RPCs. The RPCs are handled by a key/value module in the server, which calls Raft.Start() to ask Raft to put a client RPC in the log, and reads the applyCh to learn of newly committed log entries. > Will the client send the same request more than once in case > there is a failure? Yes. > Will the client send multiple requests in an async way? It depends on whether the software supports that. For the lab system, no. For a more industrial strength system, maybe yes. > One thing I don't quite understand is that, when the leader receives a > request from the client, then this server crashed. Another server is > elected as the new leader. Then, why the new leader could receive the > same request again? If the client doesn't get a response from its original request, its RPC will time out, and it will re-send the request to a different server. > I'm still kind of confused about the safety gaurentees of Raft. If > some server crashes right after receiving a request, before it commits > it, and after another term passed and the server restarts, this server > has log that is not up-to-date compared with the other servers, so it > cannot be the leader. And data cannot flow from follower to leader, so > the new leader will never know about this request, and this request is > permenently lost. Is it the case? If a log entry isn't committed, Raft may not preserve it across a leader change. That's OK because the client could not have received a reply to its request if Raft didn't commit the request. The client will know (by seeing a timeout or leader change) that its request wasn't served, and will re-send it. > How would Raft handle the split brain problem? That is, two groups of > servers are disconnected from each other. Would each group elects a > leader and would there be two leaders in total? There can be at most one active leader. A new leader can only be elected if it can contact a majority of servers (including itself) with RequestVote RPCs. So if there's a partition, and one of the partitions contains a majority of the servers, that one partition can elect a new leader. Other partitions must have only a minority, so they cannot elect a leader. If there is no majority partition, there will be no leader (until someone repairs the network partition). Suppose a new leader is elected while the network is partitioned. What about the previous leader -- how will it know to stop committing new entries? The previous leader will either not be able to get a majority of true responses to its AppendEntries RPCs (if it's in a minority partition), or if it can talk to a majority, that majority must overlap with the new leader's majority, and the servers in the overlap will tell the old leader that there's a higher term. That will cause the old leader to switch to follower. > I have a question similar to the one for the reading. Suppose we have a > situation similar to Figure 7 in the paper, but we also had a network > partition such that the two halves of the seven-node network would have three > servers each. Because neither partition can obtain a majority, no one can be > elected leader. Assuming the 7th server never comes back online (or at least > it takes a very long time), is there a way for raft to change the number of > servers in the network, or at the very least notify a sysadmin that no > progress is being made? Raft cannot do anything without a majority. The reason is that if a minority of servers were allowed to do anything, then two minorities in different partitions could do that thing. We would not want the two minorities in two different partitions to both decide to reduce the number of servers, since then they would be able to operate without knowing about each other, causing split brain. A sysadmin could easily notice that the Raft-based service was no longer responding. Perhaps the sysadmin could pick one of the minority partitions, hopefully the most up to date one, and allow it to continue by telling it to reduce the total set of servers. Raft doesn't support this, but you could imagine modifying it to do so. However, the sysadmin would have to make a complex decision here, and it would be error-prone. It would be better for the sysadmin to fix the network partition, allowing Raft to continue as designed. > In figure 7, it seems that because f will never be elected leader, the logs > that only it has i.e. all the logs from term 2, will be forever lost because > that worker will have to overwrite them. Is this correct, and if so, is this > not a concern of Raft? That's correct. The log entries in question could not have been committed (they are not on a majority of servers), so the client could not have received a reply to the requests. The client will eventually notice a change in leader, or time out its RPCs, and re-send the requests to the new leader. > What prevents a leader from never overwriting or deleting any of its logs? The leader follows the rules in Figure 2. Nothing in Figure 2 will cause a leader to delete anything from its own log. > How exactly does a candidate become eligible to receive votes from > other servers. How do follower servers know when and who to vote for? Candidates send RequestVote RPCs to all the servers. Each server votes yes or no for each RequestVote it receives, according to the algorithm under RequestVote RPC in Figure 2. > I understand that it's on a first come first serve basis, but do > candidates announce their candidacy to other servers to receive votes? A candidate sends out RequestVote RPCs; this implicitly announces its candidacy. > How reliable is it the safety assumption that electionTimeout will be > an order of magnitude greater than broadcastTime, and an order of > magnitude less than MTBF? A bad choice of election timeout does not affect safety, it only affects liveness. Somebody (the programmer?) has to choose the election timeout. The programmer could easily choose a value that's too large or too small. If too small, then followers may repeatedly time out before the leader has a chance to send out any AppendEntries. In that case Raft may spend all its time electing new leaders, and no time processing client requests. If the election timeout is too large, then there will be a needlessly large pause after a leader failure before a new leader is elected. > Could you explain more about how Raft recovers from a network > partition e.g. when there are a 3-way partition where none of the > partitions contains a majority of servers? In such a partition, Raft won't be able to elect any leader, and thus won't serve any client requests. The Raft servers will try again and again to elect a leader, never succeeding. If someone fixes the network so that the partition goes away, and a majority of servers can talk to each other, then an election will succeed, and Raft will start being able to serve client requests again. > Why are retry timeouts to become a candidate randomized? Is it to > avoid a case in which two competing candidates start a term at the > same time and continue to do so, therefore blocking an election? That's exactly right. > In the election restriction part, how to define "majority of the > cluster" that a candidate needs to contact? By "majority", the paper means more than half of the total set of servers. > If it's more than half, then for the paper question, server (a) might > also be elected I agree. > however, its does not contains all commited entries (term 7). The entries in term 7 cannot have been committed because they are not present on a majority of the servers. A leader only commits a log entry if it receives positive replies to its AppendEntries from a majority of the servers. > If a network partition is never resolved can the two sides of the > partition realize that the number of nodes for majority is different > and elect new leaders at that point? I think you are suggesting that there be two leaders, each with its own smaller majority. This can't be allowed because it will violate the goal that the replicated service look to clients as if it were an ordinary non-replicated service. That is, it will produce a "split brain". Suppose the service is a key/value database. If clients A and B perform the following sequence of puts/gets, one at a time, we'd expect the final get("key1") to yield "value2". Client A: put("key1", "value1") Client B: put("key1", "value2") Client A: get("key1") -> ??? However, if there are two leaders that both think they have majorities, and client A talks to one leader and client B talks to the other, then client A won't see client B's put value, and the last get will yield "value1". That's not legal, since it could not happen in a non-replicated single server setup. > Step 6 in section 5.4.3 of the safety proof is a little unclear to me. > If leader u shared the same last term as the voter's why does it have > to have its log at least as long as the voter's? Step 2 says the voter has the committed log entry in question, and that the voter was part of the majority that voted for leader_u. The election restriction rule (Section 5.4.1) says that if the voter and candidate's last log terms are equal, the voter will only vote "yes" if the candidate's log is at least as long as the voter's. Since each term has only one leader, and that leader sends out only one sequence of log entries for that term, the fact that the leader's log is longer and has the same last term as the voter's log implies that the leader's log contains all the entries in the voter's log for that term. Including the committed entry in question. (f) in Figure 7 is a different situation, because (f) has a different last term in its log than the new leader. > If a new server is added, how is the current state of the rest of the > machines safely replicated on to this new server? The mechanism turns out to the the same one that brings lagging followers up to date with the leader. When the new server joins, it will reject the first AppendEntries it receives from the leader, because the new server's log is empty (step 2 under AppendEntries RPC in Figure 2). Then the leader will reduce its nextIndex[i] for the new server until it is 0. Then the leader will send the new server the entire log with AppendEntries. > What exactly do the C_old and C_new variables in section 6 (and figure > 11) represent? Are they the leader in each configuration? They are the set of servers in the old/new configuration. > Is there a way to resolve multiple leader servers going down > sequentially, and deadlocking the entire system? Is it just a > probablistic that this wouldn't happen, so such scenarios are > impossible? Is the case you have in mind that a majority of the servers die, so that an election can't succeed? Indeed, Raft won't make any progress in this situation. If this happens a lot, the only way to really solve it is to get more reliable servers. > In figure 2 the authors say that the rules "trigger independently and repeatedly". > Is this the same thing as asynchronously? In a real implementation you'd likely check for the rules at points in your code (e.g. RPC handlers) that change the relevant state. You'll also need one or more independent threads that execute things like timeouts that aren't in response to a specific change in Raft's state. > If so, wouldn't execution be serialized > around the lock to raft's state? Yes. > Could the implementation of raft be further simplified by setting up a > concrete scheduling loop for events? > > Sort of like: > > func Schedule() { > //events are polled for instead of interrupting the state machine > for { > AppendEntries ? -> do append() //state changes happen inside these functions > AppendEntriesReply ? -> do append_reply() > ElectionTimeout ? -> do start_election() > VoteRequest ? -> do vote_response() > VoteRequestReply ? -> do process_vote_reply() > ... // you get it > } > } > > I imagine that even if all the events triggered asynchronously, then it is possible > for an event schedule like the one above to happen. The question is whether all > possible event schedules can happen with asynchronous triggers. This seems reasonable. Presumably the RPC handlers would send a message on the AppendEntries channel &c? In my implementation I basically directly call do append_reply() from the RPC handler, and don't have anything equivalent to Schedule(). I have a separate goroutine for election timeouts, and a separate goroutine to decide when log entries have committed and send them on the applyCh. > How does raft handle new servers joining the cluster? Have a look at the paper's Section 6. > Can a candidate declare itself the leader as soon as it receives > majority vote and discard the remaining votes? Yes -- a majority is sufficient. > Or would it create problems because it might learn of another server > with higher term in one of the remaining votes. It could indeed learn of a leader with a higher term, but the protocol deals correctly with this. Any such new leader would itself have collected a vote from a majority, and that majority would know of the new higher term number. The leader with the lower term can only modify the log if a majority accept its AppendEntries RPCs -- but they won't since at least one server out of any majority will know of the higher term of the newer leader. > Do leaders in raft ever stop being a leader outside of crashing? Yes. If a leader's network connection breaks, or loses too many packets, the other servers won't see its AppendEntries RPCs, and will start an election. Similarly if the leader or network is too slow. > If not, would this possibly cause problems with load balancing if we > had multiple disjoint services on top of the same cloud and all of the > leaders happen to converge on one server? You'd need a mechanism outside of Raft to manage this. But that mechanism could easily enough ask a Raft leader to step down. > How do log entries get handled during elections? With no clear leader, > how do none of the log entries get lost? During an election the servers don't do anything to their logs, so nothing is lost. After a leader is elected, the leader may ask followers to roll back their logs so that they become identical to the leader's log. But the algorithm for this guarantees never to discard a committed log entry, so nothing valuable will be lost. > When are followers' log entries sent to their state machines? A > follower may have a given set of log entries, but they may be changed > when a new leader emerges and there is inconsistency. Only after the leader says that an entry is committed. Raft never changes its mind about committed log entries. > The paper states that the leader never responds to a client until > after the entry is applied to the state machine. How long does this > typically take? Does this delay in latency create any problems? It depends on the service. If it's a database that stores its data on disk, and the client asked to write the data, the write might take a few tens of milliseconds. But that's more or less the same delay as the service would impose without Raft, so presumably it's OK. > If elections happen relatively often, then the server will keep being > unavailable for tiny gaps of time pretty often. Does that performance > hit really go unnoticed by clients? Clients will notice -- they will see delays, and have to re-send requests if they originally sent them to a failed leader. But elections should only happen when servers fail or are disconnected. Without Raft (or something like it), clients wouldn't get any service at all when there are failures. Raft makes things strictly better by continuing despite some kinds of failures, though indeed after an election delay. > One thing that I was unclear on when reading and working on the lab is > if when a leader sends an appendEntries RPC, if it waits for every RPC > to finish (either successfully or not), simply waits for a majority, > or neither. The leader shouldn't really wait. Instead, it should send the AppendEntries RPCs and continue. It should also look at each reply to its RPCs for a particular log entry, and count these replies, and mark the log entry as committed only when it has replies from a majority of servers (including itself). One way to do this in Go is for the leader to send each AppendEntries RPC in a separate goroutine, so that the leader sends the RPCs concurrently. Something like this: for each server { go func() { send the AppendEntries RPC and wait for the reply if reply.success == true { increment count if count == nservers/2 + 1 { this entry is committed } } } () } > What are the cases where a server may have more entries than the > leader, like the server (d) in Figure 7? (d) could have been the leader previously, for term 7, as you point out. It's also possible that some other server (?) was leader for term 7, that only (d) heard new log entries from (?), and that after a leader change the new leader caused (?) to roll back its uncommitted entries for term 7 but didn't get around to asking (d) to roll back. > Also another question is, is it true that in a leader's log, there > will be entires that are just replicated to the other servers but not > committed. Yes. > And this is fine because the client will not receive any output for > these uncommitted commands. Yes. > And these only happens when a leader is elected, and it has entires > from its old term that hasn't been replicated to all the other > servers? The leader can also have uncommitted entries in its log because it hasn't heard AppendEntries RPC replies from a majority yet. > I’m a little confused with regards to this question: it > seems the committed entries are 111445566, because a majority of the > servers have these entries. Server (a) has these entries, but only it > and (b) would vote for it, because (c) and (d) have newer entries and > wouldn’t vote, as I understand it. But shouldn’t > it technically be allowed to win, since it has all the committed > entries? I think server (a) can get a majority, since (a), (b), (e), and (f) will vote for it. > What happens if more than half of the servers die? (I also asked this > on Piazza) The service can't make any progress; it will keep trying to elect a leader over and over. If/when enough servers come back to life with persistent Raft state intact, they will be able to elect a leader and continue. > Raft does ensure consistency eventually, but aren't there lots of > practical examples where discrepancies between the logs can cause > useful data to be erased? The only log entries that will be erased correspond to client requests where the service never replied to the client (since the request never committed). Such clients will re-send their requests to a different server until they find the current Raft leader, which (if there are no more failures) will commit the requests into the log, execute them, and reply to the client. So in the end nothing will be lost. > When sending a requestVote message to another server, if the sent > message has higher term than the term of the receiving server, does > the receiving server increase its term to match taht of the sent > message? Yes. You can see this near the top of Rules for Servers in Figure 2. > Why do they choose to first index the log? Would it be problematic if > the log were zero indexed instead? It's just a convenience. I suspect the authors' implementation actually has an entry 0 in the log array, with Term=0. That allows the very first AppendEntries RPC to contain 0 as PrevLogIndex, and be a valid index into the log. I think it would be easy to modify the protocol to have a zero-indexed log, and change the various places that use 0 to mean "before the first log entry" to use -1 instead. > Would another alternative be to not allow a single server be the leader for too many > terms? Therefore, this would prevent certain servers from being a leader for too long > in case one is malicious. I'd worry that if a server is malicious, it can do as much damage as it likes in a short time. > In a replicated state machine model, the state machine reads from the > log on each replicated server. However, if some problem occurs from > the consensus module when sending commands to the logs, couldn't this > result in differing states? Why can't we have the state machines from > the different servers instead read from one log, thereby guaranteeing > the state will be consistent across all servers? One argument is fault > tolerance, in which case we can keep 2 or 3 backup logs that replace > the original if the original is lost for any reason. After reading > more of the paper, it seems that the point is that by using the > leader-follower model, we can guarantee consistency to a point between > servers. However, this seems like a more complicated and more > error-prone way. Raft basically does what you suggest, but it turns out to be fairly complex. Raft keeps one main copy of the log (in the leader), and also some backup copies of the log (in the followers). It uses the leader's copy of the data to answer client requests if the leader is alive, and switches to a backup if soemthing goes wrong. We can't expect the copies of the log to be identical at all times, since some of the servers may be crashed or have broken network connections. Much of the complexity in Raft has to do with maintaining correctness despite log copies that may not be exactly identical. > When choosing a leader in a given term, why not settle split votes > randomly, or give the servers integer ids and choose the server with > the highest id (similar to ranking idea, but only in the case of split > votes)? You could probably make elections based on these ideas work. The authors tried at least one approach like this (ranking), but note that it is more complex. If elections happened very frequently, and were often split, it might be worth designing a more complex but faster scheme. > Isn't it a waste to not choose a leader for a term? Yes; it means that sometimes elections take a few hundred milliseconds longer than strictly neccessary (an extra election timeout). But elections only happen after failures, which can't be very frequent if the system is to get any work done. So the fact that elections might sometimes take longer than necessary is probably not a very significant factor in overall system performance. > It is even admitted in the paper that elections could potentially > happen indefinitely (although unlikely due to randomized election > timeouts). This can only happen if the servers repeatedly pick random election timeouts that happen to be the same for multiple servers. This could happen, but it's like rolling a dice many times and getting the same number every time -- possible, but the probability gets small rapidly as you roll more times. > I may be understanding something wrong, but the paper says that RAFT > prevents a server from being elected if it does not have all committed > entries. However, it seems from the exercise that server "a" could be > elected without having all the committed entries. What am I missing? I believe server "a" has all of the entries that could have committed. Are you thinking about the "6" entries that "c" and "d" have that are beyond the end of "a"'s log? These cannot have been committed. In order for the leader who sent those "6" entries out to have committed them, that leader would have had to get positive replies from a majority (including itself), which is 4 of the 7 servers. But those "6" entries are only on three servers, so apparently that leader didn't get a majority, so it didn't commit them. > What happens if a leader election is necessary (the current leader > fails) and there are no followers whose logs are entirely committed, > what happens in the election? If a majority of servers is alive and can talk to each other, they will elect a leader. The elected leader will be have a log at least as up to date as any of the servers in its majority (this is the election restriction in Section 5.4.1). The elected leader may have entries at the end of its log that are not committed, perhaps because it was the only server to receive those entries from the previous leader. It will send those entries to the followers, and the entries will become committed as soon as the leader manages to commit an entry from the current term (see Section 5.4.2). > Is it always true that if a follower contains log entries that the > current leader does not have, then those log entries will eventually > be deleted? My understanding is that the leader will try to force the > follower’s log to conform to the leader’s log, > and so the disagreeing entries will be erased. Yes -- as soon as the new leader commits any new entry, it will tell the follower with conflicting log entries to delete them and replace them with the leader's log entries. > However, what happens to these entries that are erased? Are they not > important to the client, and does the system not notify the client > that some actions failed to take place? The follower will delete the log entries without notifiying the client. The reason this is OK is that the entries could not have been committed, so whatever leader the client sent the RPCs to could not have replied to the RPCs, so the client could not have thought the RPCs had succeeded. The client probably timed out and re-sent the RPCs to a new leader. > In Figure 7, how does a machine get extra uncommitted logs? Example: server (c)'s last log entry is uncommitted, since it doesn't exist on a majority of servers. This could have happened because server (c) was the leader for term 6, and lost leadership (perhaps crashed) after sending itself a command for log index 11 but before sending it to any other server. > Is it possible that none of the nodes satisfy the election > restriction? i.e. they have all the committed logs? I guess it's true > because the 3 properties guarantee the followers would only accept a > new log when all of its previous entries are correct and the latest > entry will only be committed when there is at least one "good" > follower? I'm not sure though. I believe that at least one server always satisfies the "at least as up-to-date" rule described at the end of 5.4.1. It's true that two servers might have the same last term and the same length log, but then either of them is allowed to be leader. > Also, when network partition happens, if the clients send commands to > different networks, wouldn't these sent to the "smaller" groups be > lost? I guess in that case, the client can send again but I'm not > sure. Yes, only the partition with a majority of servers can commit and execute client operations. The servers in the minority partition(s) won't be able to commit client operations, so they won't reply to client requests. Clients will keep re-sending the requests until they can contact a majority Raft partition. > Even with the authors' explanation, I still cannot fully understand > Figure 7. In that situation, why could server (c) and server (c) have > extra uncommitted entries? Server (c)'s last log entry is uncommitted, since it doesn't exist on a majority of servers. This could have happened because server (c) was the leader for term 6, and lost leadership (perhaps crashed) after sending itself a command for log index 11 but before sending it to any other server. Server (d) was probably the leader for term 7, and sent itself two commands (log indices 11 and 12), but crashed before sending the commands to any other server. > One more question about "majority": > > For example, there are 8 servers. Server A is the elected leader and > the rest are followers. If something bad happened and 4 followers > crashed and they will never send message back to the leader. There are > only 3 followers left. I think even Server A gets all the response > from the remaining followers, the number is still less than the > "majority". In this way, the leader will never get more committed > entries. What can Raft do in this situation? At least a majority (5 servers) have to be alive and in communication in order for Raft to make progress. In this situation, the 4 live servers will repeatedly attempt to elect a leader, and fail (since there is no majority). If one of the 4 crashed servers reboots, then there will be a majority, and Raft will elect a leader and continue. > Sections 1 to 5 look very straight-forward. How do we know the > argument on 5.4.3 complete for proving properties? It seems logical > but I am not sure if the argument covers all the possible cases. 5.4.3 is not a complete proof. Here are some places to look for something more complete: http://ramcloud.stanford.edu/~ongaro/thesis.pdf http://verdi.uwplse.org/raft-proof.pdf