6.824 2020 Lecture 11: Chain Replication Chain replication for supporting high throughput and availability (OSDI 2004) by Renesse and Schneider Context: replication state machines Two main approach: 1. Run all operations through Paxos/Raft Relatively uncommon approach but see Spanner (and lab 3) Performance is challenging, especially if data is large 2. Configuration server plus primary/backup replication For example: GFS, master plus P/B replication for chunks For example: VM-FT, test-and-set server plus P/B replication Chain replication (CR) This paper about how to do P/B replication for approach 2 - covers fail-over Influential: many build on CR Ceph, Parameter Server, COPS, FAWN. What is Chain Replication (CR)? Goals: if client gets write reply, data is safe if even one server survives. and linearizable. S1, S2, S3, S4 S1 is "head" S4 is "tail" Writes: Client sends to head Forwarded down the chain, in order Each server overwrites old data with new data Tail responds to client Reads: Client sends to tail Tail responds (no other nodes involved) Configuration service's role The config service monitors the servers If a server is unreachable: config makes a new configuration and informs clients and servers no service during this time Config service avoids split brain Would it be correct (linearizable) to let clients read any replica in CR? No. A read could see uncommitted data, might disappear due to a failure. A client could see a new value from one replica, and then an older value from a different (later) replica. Thus, read only from the tail Lecture question: Give an example scenario where a client could observe incorrect results (i.e., non-linearizable) if the head of the chain would return a response to the client as soon as it received an acknowledgment from the next server in the chain (instead of the tail responding). Intuition for linearizability of CR? When no failures, almost as if the tail were the only server. Head picks an order for writes, replicas apply in that order, so they will stay in sync except for recent (uncommitted) writes. Tail exposes only committed writes to readers. Failure recovery, briefly. Good news: every replica knows of every committed write. But need to push partial writes down the chain. If head fails, successor takes over as head, no commited writes lost. If tail fails, predecessor takes over as tail, no writes lost. If intermediate fails, drop from chain, predecessor may need to re-send recent writes. Adding server Extend tail, briefly: copy tail's state to new server tail records updates while copy is in progress new server tells tail it isn't the tail anymore tail send all recorded updates to new server new server tells configuration server, it is tail new server starts serving requests old tail redirect client requests Why is CR attractive (vs Raft)? Client RPCs split between head and tail, vs Raft's leader handles both. Head sends each write just once, vs Raft's leader sends to all. Reads involve just one server, not all as in Raft. Or majority with Raft read optimization Situation after failure simpler than in Raft (remember Figure 7). [Consider some scenarios] Why is it attractive to let clients read any replica in CR? The opportunity is that the intermediate replicas may still have spare CPU cycles when the read load is high enough to saturate the tail. Moving read load from the tail to the intermediate nodes might thus yield higher read throughput on a saturated chain. The CR paper describes one way to achieve this goal: Split objects over many chains, each server participates in multiple chains. C1: S1 S2 S3 C2: S2 S3 S1 C3: S3 S1 S2 This works if load is more or less evenly divided among chains. It often isn't. Maybe you could divide objects into even more chains. Why can CR serve reads from replicas linearizably but Raft/ZooKeeper/&c cannot? Relies on being a chain, so that *all* nodes see each write before the write commits, so nodes know about all writes that might have committed, and thus know when to ask the tail. Raft/ZooKeeper can't do this because leader can proceed with a mere majority, so can commit without all followers seeing a write, so followers are not aware when they have missed a committed write. Does that mean CR is strictly more powerful than Raft &c? No. All CR replicas have to participate for any write to commit. If a node isn't reachable, CR must wait. So not immediately fault-tolerant in the way that ZK and Raft are. Equivalently, why can't 2nd node take over as head if it can't reach the head? Partition -- split brain -- the 2nd node must wait patiently. How can we safely make use of a replication system that can't handle partition? A single "configuration manager" must choose head, chain, tail. Everyone (servers, clients) must obey or stop. Regardless of who they locally think is alive/dead. A configuration manager is a common and useful pattern. It's the essence of how GFS (master) and VMware-FT (test-and-set server) work. Usually Paxos/Raft/ZK for config service, data sharded over many replica groups, CR or something else fast for each replica group. Lab 4 works this way (though Raft for everything).