6.824 FAQ for Chain replication for supporting high throughput and availability (OSDI 2004) by Renesse and Schneider Q: Is chain replication used in practice over other things like Raft or Paxos? A: Systems often use both. A common way of building distributed systems is to use a configuration server (called the master in the paper) for maintaining configuration info (e.g., who is primary?) and a replication system for replicating data. Paxos/Raft are commonly-used to build the configuration server while the replication system often uses primary-backup or chain replication. The reason to use Raft/Paxos for configuration server is it must handle split-brain syndrome. The reason to use primary/backup for data replication is that it is simpler than Raft/Paxos and Raft, for example, is not good at for replicating large amounts of data. The replication system can rely on the configuration server to avoid split-brain syndrome. Q: How does CR cope with network partition and prevent split brain? A: At a high level, a chain will pause operation if one of its servers or network links fails, and wait for the configuration server to notice the problem and reconfigure the chain. Let's consider separately the situation before the configuration server notices a problem, and after it notices. Before the configuration server notices (or if the configuration server doesn't notice any problem), if the partition prevents communication between successive chain servers, updates will stop completing because they can no longer travel down the chain all the way from head to tail. Read queries will continue to work for clients that can talk to the tail. The system is safe (linearizable), but not very live since updates can't complete. At some point the configuration server may see that it can't communicate with one or more of the chain servers, and will consider those servers to have failed (though they may actually be alive). If the configuration server can't talk to any of the chain's servers, then the master will do nothing, and the existing chain may continue to provide correct service (perhaps without completing updates) to the clients that can talk to it. If the configuration server thinks just the head is dead, it will direct clients to send updates to the 2nd server in the chain, and tell the 2nd server that it is now the head. But perhaps the old head is not dead, and merely partitioned from the configuration server. In that case, the paper does not explain how to avoid split brain: now there may be two servers operating as head and forwarding conflicting updates to the next server in the chain. You can imagine solutions -- for example the new head, which is the old 2nd server in the chain, could reject updates sent to it by the old head. If the configuration server thinks the tail is dead, it will tell clients to send read queries to the N-1'th server in the chain, and tell that server that it is now the tail. But perhaps the old tail is not dead, and merely partitioned from the configuration server. In that case, some clients may still send read queries to the old tail, which will now return stale values because it is no longer receiving updates from the chain. Again, this is split brain, but the paper doesn't explain how to avoid it. A possible solution is for the configuration server to grant a lease to the tail, and to delay designating any new tail until the previous tail's lease has expired. Q: What are the tradeoffs of Chain Replication vs Raft or Paxos? A: Both CR and Raft/Paxos are replicated state machines. They can be used to replicate any service that can fit into a state machine mold (basically, processes a stream of requests one at a time). One application for Raft/Paxos is object storage -- you'll build object storage on top of Raft in Lab 3. Similarly, the underlying machinery of CR could be used for services other than storage, for example to implement a lock server. CR is likely to be faster than Raft because the CR head does less work than the Raft leader: the CR head sends writes to just one replica, while the Raft leader must send all operations to all followers. CR has a performance advantage for reads as well, since it serves them from the tail (not the head), while the Raft leader must serve all client requests. However, Raft/Paxos and CR differ significantly in their failure properties. If there are N replicas, a CR system can recover if even a single replica survives. Raft and Paxos require a majority of the N replicas to be available in order to operate, so in that sense they are less tolerant of failure. But a CR chain has to pause updates if there's even a single failure, and must wait for the configuration server to notice and reconfigure the chain; in the paper's setup this takes ten seconds. Raft/Paxos, in contrast, can continue operating without interruption as long as a majority is available, so they handle a slow or flaky or briefly unavailable replica more smoothly than CR. Q: Would Chain Replication be significantly faster or slower than the kind of primary/backup used in GFS? A: If there are just two replicas, there's probably not much difference. Though maybe CR would be faster for writes since the tail can send responses directly to the client; in a classic primary/backup scheme, the primary has to wait for the backup to acknowledge a write before the primary responds to the client. If there are three or more replicas, the primary in a classic primary/backup system has to send each write to each of the replicas. If the write data is big, these network sends could put a significant load on the primary. Chain Replication spreads this networking load over all the replicas, so CR's head node might be less of a performance bottleneck than a classic primary. On the other hand maybe the client-observed latency for writes would be higher in CR. Q: Section 5.2 evaluates multiple chains. What does this mean? A: In Chain Replication, only the head and tail directly serve client requests; the other replicas help fault tolerance but not performance. Since the load on the head and tail is thus likely to be higher than the load on intermediate nodes, you could get into a situation where performance is bottlenecked by head/tail, yet there is plenty of idle CPU available in the intermediate nodes. Section 5.2 uses CR in a way that avoids this limitation. A data center will probably have lots of distinct CR chains, each serving a fraction (shard) of the objects. Suppose you have three servers (S1, S2, and S3) and three chains (C1, C2, C3). Then you can have the three chains be: C1: S1 S2 S3 C2: S2 S3 S1 C3: S3 S1 S2 Now, assuming activity on the three chains is roughly equal, the load on the three servers will also be roughly equal. In particular the load of serving client requests (head and tail) will be roughly equally divided among the three servers. Q: Section 3 says servers are assumed to be fail-stop. Are servers typically fail-stop? How does that work? A: Servers, networks, disks, &c are not actually fail-stop. CPU hardware sometimes produces incorrect answers, disks and networks sometimes corrupt data, software sometimes has bugs that cause it to malfunction without warning, human operators sometimes mis-configure systems. Server hardware is designed to be fail-stop for some errors (checksums catch most corrupted network packets, ECC catches most RAM errors, &c), but not all errors. What the paper means by "we assume servers to be fail-stop" is "if all failures are fail-stop, then the claims we make in this paper will hold. If you encounter a non-fail-stop failure, then the claims in the paper may not hold." Most of the systems we'll look at assume fail-stop failures, and may silently malfunction if there are non-fail-stop ("Byzantine") failures. But there are designs that have good behavior in many non-fail-stop situations. First, systems derived from a paper titled Practical Byzantine Fault Tolerance (PBFT) by Castro and Liskov; PBFT is like Raft but the servers check each others' actions with extra rounds of cryptographically authenticated communication. Second, systems in which clients can check the correctness of results that servers return, typically by use of cryptographic hashes or signatures. This can be tricky because clients need to defend against a server that returns data whose signature or hash is correct, but is not the latest value. Systems like this include SUNDR and Bitcoin. Q: Is Chain Replication used by other systems? A: Some examples: Amazon's EBS, Ceph's Rados, Google's Parameter Server, COPS, and FAWN. Q: I'm getting confused with all the symbols (particularly the circle with a plus sign inside) and the invariants given. A: The confusion perhaps arises because the XOR symbol doesn't mean XOR on boolean values. It is more like a union, but not set union because the left and right are not sets, but sequences. My guess is that the analogy with boolean XOR symbol is that a CR operation can appear only in the left or only in the right side of the XOR on sequences (but not both). Q: This scheme would be bad if we care about latency right? A: Yes and No. Yes, the latency of update operations is proportional to the length of the chain. No, the latency of read operation is low: only the tail is involved. Q: The paper reports that chain replication has the highest MTBU when using the `rndpar` volume placement strategy. Are there any competing advantages offered by `ring` or is `rndpar` objectively better? A: Figure 7 suggests that for a small number of servers, ring has a slight advantage over rndpar. The reason is that, for a chain length of N, if N random servers fail, there is more likely to be some chain that uses just those N servers with rndpar than with ring. That's important because the chain on those servers then cannot be repaired. With more servers, rndpar has an advantage because it spreads the work of repair over more servers, so that repair is more likely to complete before the next failure. If failures occur faster than repair, that increases the chance that all of a chain's servers will fail before any can be repaired. Q: For the failure of other servers as described on page 5, what happens if the ack from the tail gets lost some time along the chain, and r never gets deleted from sent_i? Updates are not idempotent, so it can't just be resent down the chain, correct? A: Updates are pushed down the chain, one server at the time. Thus only the one-but-last server is waiting for the tail's ack. It will keep retrying until it gets the ack. All updates are performed in order that the head made them; it is fine to do the same update a few times. Q: Question: In real systems, how important is the strong consistency guarantee (does it cause people to use this system over GFS-like systems?) If not, why do people prefer other systems? A: Programming with weak consistency can be difficult. I have seen no/few studies that quantify this. But, there is anecdotal evidence that this matters; for example, when Amazon changed to strong consistency for S3, Dropbox (a user of S3) was able to simplify their code significantly. Similarly, Google over the years has been offering services with stronger consistency for their programmers; for example, Spanner provides stronger consistency than GFS, partially to make programmers' lives easier. Q: To reduce latency, why not have an arrangement where we have the head, t - 2 concurrent middle servers, and a tail? A: You could but it complicates the design. Now the recovery is more complicated; who should take over if the head fails? New logic is required to ensure that the most up-to-date of the t-2 servers is the one chosen as head. In CR unmodified it is clear: the next server in the chain. Q: What's a "reliable FIFO link"? A: By "reliable FIFO link" the paper means "TCP over an ordinary network". I suspect also that each update includes the sequence number assigned to it by the head, to make it easier to ensure that the updates are sent and processed in order. This "reliable FIFO link" may fail, if the network stops delivering packets.