Raft FAQ Q: Does Raft sacrifice anything for simplicity? A: Raft gives up some performance in return for clarity; for example: * Every operation must be written to disk for persistence; performance probably requires batching many operations into each disk write. * There can only usefully be a single AppendEntries in flight from the leader to each follower: followers reject out-of-order AppendEntries, and the sender's nextIndex[] mechanism requires one-at-a-time. A provision for pipelining many AppendEntries would be better. * The snapshotting design is only practical for relatively small states, since it writes the entire state to disk. If the state is big (e.g. if it's a big database), you'd want a way to write just parts of the state that have changed recently. * Similarly, bringing recovering replicas up to date by sending them a complete snapshot will be slow, needlessly so if the replica already has an old snapshot. * Servers may not be able to take much advantage of multi-core because operations must be executed one at a time (in log order). These could be fixed by modifying Raft, but the result might have less value as a tutorial introduction. Q: Is raft used in real-world software, or do companies generally roll their own flavor of Paxos (or use a different consensus protocol)? A: Raft is pretty new, so there hasn't been much time for people to design new systems based on it. Most of the state-machine replication systems I hear about are based on the older Multi-Paxos and Viewstamped Replication protocols. Q: What is Paxos? In what sense is Raft simpler? A: There is a protocol called Paxos that allows a set of servers to agree on a single value. While Paxos requires some thought to understand, it is far simpler than Raft. Here's an easy-to-read paper about Paxos: http://css.csail.mit.edu/6.824/2014/papers/paxos-simple.pdf However, Paxos solves a much smaller problem than Raft. To build a real-world replicated service, the replicas need to agree on an indefinite sequence of values (the client commands), and they need ways to efficiently recover when servers crash and restart or miss messages. People have built such systems with Paxos as the starting point; look up Google's Chubby and Paxos Made Live papers, and ZooKeeper/ZAB. There is also a protocol called Viewstamped Replication; it's a good design, and similar to Raft, but the paper about it is hard to understand. These real-world protocols are complex, and (before Raft) there was not a good introductory paper describing how they work. The Raft paper, in contrast, is relatively easy to read and fairly detailed. That's a big contribution. Whether the Raft protocol is inherently easier to understand than something else is not clear. The issue is clouded by a lack of good descriptions of other real-world protocols. In addition, Raft sacrifices performance for clarity in a number of ways; that's fine for a tutorial but not always desirable in a real-world protocol. Q: How long had Paxos existed before the authors created Raft? How widespread is Raft's usage in production now? A: Paxos was invented in the late 1980s. Raft was developed around 2012. Raft closely resembles a protocol called Viewstamped Replication, originally published in 1988. There were replicated fault-tolerant file servers built on top of Viewstamped Replication in the early 1990s, though not in production use. A bunch of real-world systems are derived from Paxos: Chubby, Spanner, Megastore, and Zookeeper/ZAB. Starting in the early 2000s big web sites and cloud providers needed fault-tolerant services, and Paxos was more or less re-discovered at that time and put into production use. There are several real-world users of Raft: Docker (https://docs.docker.com/engine/swnarm/raft/) and etcd (https://etcd.io). Other systems said to be using Raft include CockroachDB, RethinkDB, and TiKV. Maybe you can find more starting at http://raft.github.io/ Q: How does Raft's performance compare to Paxos in real-world applications? A: The fastest Paxos-derived protocols are probably much faster than Raft; have a look at ZAB/ZooKeeper and Paxos Made Live. But I suspect one could modify Raft to have very good performance. etcd3 claims to have achieved better performance than zookeeper and many Paxos-based implementations (https://www.youtube.com/watch?v=hQigKX0MxPw). Q: Why are we learning/implementing Raft instead of Paxos? A: We're using Raft in 6.824 because there is a paper that clearly describes how to build a complete replicated service using Raft. I know of no satisfactory paper that describes how to build a complete replicated server system based on Paxos. Q: Are there systems like Raft that can survive and continue to operate when only a minority of the cluster is active? A: Not with Raft's properties. But you can do it with different assumptions, or different client-visible semantics. The basic problem is split-brain -- the possibility of more than one server acting as leader. There are two approaches that I know of. If somehow clients and servers can learn exactly which servers are live and which are dead (as opposed to live but partitioned by network failure), then one can build a system that can function as long as one is alive, picking (say) the lowest-numbered server known to be alive. However, it's hard for one computer to decide if another computer is dead, as opposed to the network losing the messages between them. One way to do it is to have a human decide -- the human can inspect each server and decide which are alive and dead. The other approach is to allow split-brain operation, and to have a way for servers to reconcile the resulting diverging state after partitions are healed. This can be made to work for some kinds of services, but has complex client-visible semantics (usually called "eventual consistency"). Have a look at the Bayou and Dynamo papers which are assigned later in the course. Q: In Raft, the service which is being replicated is not available to the clients during an election process. In practice how much of a problem does this cause? A: The client-visible pause seems likely to be on the order of a tenth of a second. The authors expect failures (and thus elections) to be rare, since they only happen if machines or the network fails. Many servers and networks stay up continuously for months or even years at a time, so this doesn't seem like a huge problem for many applications. Q: Are there other consensus systems that don't have leader-election pauses? A: There are versions of Paxos-based replication that do not have a leader or elections, and thus don't suffer from pauses during elections. Instead, any server can effectively act as leader at any time. The cost of not having a leader is that more messages are required for each agreement. Q: How are Raft and VMware FT related? A: VM-FT can replicate any virtual machine guest, and thus any server-style software. Raft can only replicate software designed specifically for replication for Raft. For such software, Raft would likely be much more efficient than VM-FT. Q: Why can't a malicious person take over a Raft server, or forge incorrect Raft messages? A: Raft doesn't have any defense against attacks like this. It assumes that all participants are following the protocol, and that only the correct set of servers is participating. A real deployment would have to do better than this. The most straightforward option is to place the servers behind a firewall to filter out packets from random people on the Internet, and to ensure that all computers and people inside the firewall are trustworthy. There may be situations where Raft has to operate on the same network as potential attackers. In that case a good plan would be to authenticate the Raft packets with some cryptographic scheme. For example, give each legitimate Raft server a public/private key pair, have it sign all the packets it sends, give each server a list of the public keys of legitimate Raft servers, and have the servers ignore packets that aren't signed by a key on that list. Q: The paper mentions that Raft works under all non-Byzantine conditions. What are Byzantine conditions and why could they make Raft fail? A: "Non-Byzantine conditions" means that the servers are fail-stop: they either follow the Raft protocol correctly, or they halt. For example, most power failures are non-Byzantine because they cause computers to simply stop executing instructions; if a power failure occurs, Raft may stop operating, but it won't send incorrect results to clients. Byzantine failure refers to situations in which some computers execute incorrectly, because of bugs or because someone malicious is controlling the computers. If a failure like this occurs, Raft may send incorrect results to clients. Most of 6.824 is about tolerating non-Byzantine faults. Correct operation despite Byzantine faults is more difficult; we'll touch on this topic at the end of the term. Q: In Figure 1, what does the interface between client and server look like? A: Typically an RPC interface to the server. For a key/value storage server such as you'll build in Lab 3, it's Put(key,value) and Get(value) RPCs. The RPCs are handled by a key/value module in the server, which calls Raft.Start() to ask Raft to put a client RPC in the log, and reads the applyCh to learn of newly committed log entries. Q: What if a client sends a request to a leader, the the leader crashes before sending the client request to all followers, and the new leader doesn't have the request in its log? Won't that cause the client request to be lost? A: Yes, the request may be lost. If a log entry isn't committed, Raft may not preserve it across a leader change. That's OK because the client could not have received a reply to its request if Raft didn't commit the request. The client will know (by seeing a timeout or leader change) that its request wasn't served, and will re-send it. The fact that clients can re-send requests means that the system has to be on its guard against duplicate requests; you'll deal with this in Lab 3. Q: If there's a network partition, can Raft end up with two leaders and split brain? A: No. There can be at most one active leader. A new leader can only be elected if it can contact a majority of servers (including itself) with RequestVote RPCs. So if there's a partition, and one of the partitions contains a majority of the servers, that one partition can elect a new leader. Other partitions must have only a minority, so they cannot elect a leader. If there is no majority partition, there will be no leader (until someone repairs the network partition). Q: Suppose a new leader is elected while the network is partitioned, but the old leader is in a different partition. How will the old leader know to stop committing new entries? A: The old leader will either not be able to get a majority of successful responses to its AppendEntries RPCs (if it's in a minority partition), or if it can talk to a majority, that majority must overlap with the new leader's majority, and the servers in the overlap will tell the old leader that there's a higher term. That will cause the old leader to switch to follower. Q: When some servers have failed, does "majority" refer to a majority of the live servers, or a majority of all servers (even the dead ones)? A: Always a majority of all servers. So if there are 5 Raft peers in total, but two have failed, a candidate must still get 3 votes (including itself) in order to elected leader. There are many reasons for this. It could be that the two "failed" servers are actually up and running in a different partition. From their point of view, there are three failed servers. If they were allowed to elect a leader using just two votes (from just the two live-looking servers), we would get split brain. Another reason is that we need the majorities of any two leader to overlap at at least one server, to guarantee that a new leader sees the previous term number and any log entries committed in previous terms; this requires a majority out of all servers, dead and alive. Q: What if the election timeout is too short? Will that cause Raft to malfunction? A: A bad choice of election timeout does not affect safety, it only affects liveness. If the election timeout is too small, then followers may repeatedly time out before the leader has a chance to send out any AppendEntries. In that case Raft may spend all its time electing new leaders, and no time processing client requests. If the election timeout is too large, then there will be a needlessly large pause after a leader failure before a new leader is elected. Q: Why randomize election timeouts? A: To reduce the chance that two peers simultaneously become candidates and split the votes evenly between them, preventing anyone from being elected. Q: Can a candidate declare itself the leader as soon as it receives votes from a majority, and not bother waiting for further RequestVote replies? A: Yes -- a majority is sufficient. It would be a mistake to wait longer, because some peers might have failed and thus not ever reply. Q: Can a leader in Raft ever stop being a leader except by crashing? A: Yes. If a leader's CPU is slow, or its network connection breaks, or loses too many packets, or delivers packets too slowly, the other servers won't see its AppendEntries RPCs, and will start an election. Q: When are followers' log entries sent to their state machines? A: Only after the leader says that an entry is committed, using the leaderCommit field of the AppendEntries RPC. At that point the follower can execute (or apply) the log entry, which for us means send it on the applyCh. Q: Should the leader wait for replies to AppendEntries RPCs? A: The leader should send the AppendEntries RPCs concurrently, without waiting. As replies come back, the leader should count them, and mark the log entry as committed only when it has replies from a majority of servers (including itself). One way to do this in Go is for the leader to send each AppendEntries RPC in a separate goroutine, so that the leader sends the RPCs concurrently. Something like this: for each server { go func() { send the AppendEntries RPC and wait for the reply if reply.success == true { increment count if count == nservers/2 + 1 { this entry is committed } } } () } Q: What happens if more than half of the servers die? A: The service can't make any progress; it will keep trying to elect a leader over and over. If/when enough servers come back to life with persistent Raft state intact, they will be able to elect a leader and continue. Q: Why is the Raft log 1-indexed? A: You should view it as zero-indexed, but starting out with one entry (at index=0) that has term 0. That allows the very first AppendEntries RPC to contain 0 as PrevLogIndex, and be a valid index into the log. Q: When network partition happens, wouldn't client requests in minority partitions be lost? A: Yes, only the partition with a majority of servers can commit and execute client operations. The servers in the minority partition(s) won't be able to commit client operations, so they won't reply to client requests. Clients will keep re-sending the requests until they can contact a majority Raft partition. Q: Is the argument in 5.4.3 a complete proof? A: 5.4.3 is not a complete proof. Here are some places to look: http://ramcloud.stanford.edu/~ongaro/thesis.pdf http://verdi.uwplse.org/raft-proof.pdf