6.824 2013 Lecture 3: Primary/Backup Replication today state machine replication primary / backup replication explain lab 2 ideas discuss paper today's goals: availablity: still useable despite [some class of] failures correctness: act just like a single server to clients and handle network failures and repair approach: state machine replication replicate state (e.g., key/value store) apply operations in the same order view each replica has a state machine each replica starts in the same state applies all operations in the same order -> they end up in the same state assumption: operations are deterministic what is operation needs the time? if two replicas call gettime() they may get different answers many variants of state machine replication today: primary/backup approach basic primary/backup idea multiple clients, sending requests concurrently two server roles: primary, backup which server is primary may change due to failures clients send operations to primary (e.g. Put, PutHash, Get) primary chooses the order primary forwards operations (or results) to backup primary replies to client many challenges: non-deterministic operations network partition state transfer between primary and backup new backups after backup becomes primary two primary-back up systems today Hypervisor-based fault tolerance deals with non-determinism but doesn't tolerate network partitions Lab 2 deals with network partition CASE STUDY: HYPERVISOR-BASED FAULT-TOLERANCE Bressoud and Schneider (SOSP 1995) Detailed description of a primary/backup system Very general -- not application specific like the labs Motivation Goal: fault tolerance / availability for any existing s/w by running same instructions &c on two computers Transparent: any app, any O/S Would be magic if it worked well! Failure model: computer fail due to independent hw faults Plan 1: [simple diagram] Two machines Identical start state: registers, program, memory contents Maybe they will stay identical If one fails, the other just keeps going, no time/data/work is lost What will go wrong with Plan 1? external outputs will be duplicated must send external inputs to both inputs must arrive at the same time at both interrupts must arrive at same time at both CPUs unlikely to run at exactly the same speeds The basic plan: Ignore backup's I/O instructions Hide interrupts from backup Primary forwards I/O results and interrupts to backup Inject those interrupts into backup Q: do we have to be exact about timing of interrupts? can backup just fake an interrupt when msg arrives from primary? How are they able to control I/O and interrupt timing? they slip a virtual machine / hypervisor under the O/S What is a hypervisor? a piece of software emulates a real machine precisely so an O/S &c can run on a hypervisor "guest" as if hypervisor simulated each instruction and kept simulated registers, memory, &c BUT as much as possible runs guest on real h/w! can be much faster than interpreting instructions Many fascinating details, see VMware papers for x86 story Hypervisor gives us control Suppress output from backup, detect interrupts/input at primary, &c Still need a scheme for delivering I/O inputs and interrupts From primary to backup, at the right times But: hard to know when interrupt happened at primary, repeat at backup Answer: HP PA-RISC recovery counter It forces an interrupt every N instructions This allows primary and backup hypervisor to get control at exactly the same point in the code Epochs Primary alternates epoch and end of epoch; backup does too Every epoch is same # of instructions, e.g. 4000 instructions Primary during epoch E: Deliver interrupts buffered during E-1 Interrupts deliver input, and acknowlege output Then continue executing I/O instructions start input/output (but not done until interrupt) Hypervisor hides new interrupts, just buffer, w/ I/O input Primary at end of epoch E: Send buffered interrupt info, including I/O input, to backup Wait for backup to ACK Send "done with E" to backup Backup during epoch E: Deliver interrupts, and I/O input, primary told us of during E-1 Then continue executing I/O instructions do nothing Ignore interrupts from backup devices Backup at end of epoch E: Receive and buffer interrupt/input msgs from primary Wait for "done with E+1" from primary Start epoch E+1 Note backup lags primary by one epoch Backup doesn't start 7 until primary is done with 7 What if primary fails during epoch E? Backup finishes with E-1 Backup times out waiting for "done with E" Can backup just switch to being primary for epoch E? No: primary might have crashed in the middle of writing device registers May not be correct for backup to repeat I/O register writes Example: Maybe for a disk read the driver is required to write disk controller registers in a specific order, e.g. track #, sector #, length, DMA address Primary did *some* of those writes before crashing Backup might then write track # when h/w expecting sector # Good news: many h/w devices can be reset to known state And I/O operations restarted Existing drivers already know how to handle "please reset me" interrupts At any rate, true for HP-UX disk driver Failover strategy: Backup times out waiting for "done with E" Backup executes epoch E as backup (no output, I/O suppressed) Switches to primary for E+1 Hypervisor generates "uncertain interrupts" for I/O started <= E Including suppressed I/O during failover epoch E During E+1: Drivers respond to uncert interrupts w/ reset, repeat of waiting I/O ops (all this is based on footnote 5 on page 4, and IO1 and P8 on page 5) Potential problem: repeated output Maybe primary crashed just as it finished E And it completed some output, e.g. disk write or network pkt send Hypervisor will generate uncertain interrupt for those I/O ops Backup will repeat them in E+1 Will the outside world tolerate repeated output? In general, maybe not E.g. controlling an instrument in a scientific experiment Network protocols: No problem, Must handle duplicated packets anyway Shared disk: No problem, repeated write of same data Do O/S drivers on primary talk directly to h/w? I suspect O/S talks to device simulated/mediated by hypervisor Hypervisor needs to be able to collect input read from devices and sent it to backup So hypervisor must be intervening (and understand) access to dev h/w Also hypervisor must initialize dev h/w when backup takes over Figure 1 shows a shared disk Rather than separate disk per machine with identical content Only the primary writes the disk Why does shared disk make the design easier? Disk writes are non deterministic one disk may have bad sector and write fails so (unlike RAM) backup's disk won't naturally be identical to primary's Simplifies recovery of failed machine Don't need to copy whole disk from survivor Won't disk failures ruin the fault-tolerance story? Can be fixed separately, w/ RAID And my memory of 1990 is that disks failed much less often than computers crashed Today's question: What if system had private disks instead of one dual-ported disk? Would that make the system more fault-tolerant? should backup suppress disk I/O instructions? should backup suppress disk interrupts? should backup directly read its own disk? or should primary forward read data? What if ethernet cable breaks? primary still running backup will try to promote itself that's part of what the fail-stop assumption assumes away... fail-stop is reasonable for a tightly coupled system they could probably make a 10-foot ethernet pretty reliable or have two separate ethernets What if we want to re-start a failed+repaired primary? What if the computers have multiple cores + threads? Would this scheme work w/o modification? What we can expect for performance? When will performance be bad? Frequent interrupts -- may be delayed by epoch scheme Lots of input -- must be sent over ethernet Many privileged instructions -- many hypervisor traps Should epochs be short or long? Short means many end-of-epoch pauses + chatter Long means less overhead But I/O interrupts delayed longer What performance do they see? CPU-bound: Figure 2 Disk-bound (really seek-bound): Figure 3 Why do writes do better than reads? Why much better relative performance for disk-bound than CPU-bound? Why does Figure 3 write performance level off? Why isn't it heading down towards 1.0, as in Figure 2? What is the limiting factor in performance? 442-microsecond epoch overhead 442 us is 22100 instructions (50 mHz) So 22000-instruction epochs -> 2x slowdown for CPU-bound Plus time to emulate privileged instructions Is the 442 microseconds CPU time, or network time? Figure 4 ATM experiment suggests CPU time, but not clear Does anyone use these ideas today? Five years ago I said "no" -- too slow Instead, specialized replicated storage systems Like my original client/server diagram, for put/get &c But now: yes! VMware has a fault-tolerant VM system Same basic idea, but more complete and sophisticated no epochs primary has no restrictions on where interrupts occur backup can cause them to occur at same place primary holds each output until backup ACKs to ensure backup will produce same output if primary fails but primary can continue executing while it waits fault-tolerant network disks copes with partition by test-and-set on disk at most one of primary and backup will win no progress if network disk not reachable automatic creation of new backup after failure on some spare VM host, don't need to repair same hardware much faster: only 10% slow-down, not paper's 2X PRIMARY-BACKUP REPLICATION IN LAB 2 what wider classes of failure would we like to handle? temporary or permanent loss of connectivity network partitions == can't know if a server is crashed or just not reachable what is the risk of having network failure between primary/backup clients, primary, backup might not agree abt who is primary or about whether backup is alive (and thus if it is up to date) result: two primaries e.g. some clients switch to backup, others don't or one client switches back and forth they don't see each other's updates! "split brain" does *not* achieve goal of acting like single server lab 2 goals: tolerate network problems, including partition either keep going, correctly or suspend operations until network is repaired replacement of failed servers lab 2 overview: agreement: "view server" decides who p and b are clients and servers ask view server they don't make independent decisions only one vs, avoids multiple machines independently deciding who is p repair: view server can co-opt "idle" server as b after old b becomes p primary initializes new backup's state the tricky part: 1. only one primary! 2. primary must have state! we will work out some rules to ensure these view server maintains a sequence of "views" view #, primary, backup 1: S1 S2 2: S2 -- 3: S2 S3 monitors server liveness each server periodically sends a Ping RPC "dead" if missed N Pings in a row "live" after single Ping can be more than two servers Pinging view server if more than two, "idle" servers if primary is dead new view with previous backup as primary if backup is dead, or no backup new view with previously idle server as backup OK to have a view with just a primary, and no backup but -- if an idle server is available, make it the backup how to ensure new primary has up-to-date replica of state? only promote previous backup i.e. don't make an idle server the primary but what if the backup hasn't had time to acquire the state? how to avoid promoting a state-less backup? example: 1: S1 S2 S1 stops pinging viewserver 2: S2 S3 S2 *immediately* stops pinging 3: S3 -- potential mistake: maybe S3 never got state from S2 better to stay with 2/S2/S3, maybe S2 will revive how can viewserver know it's OK to change views? lab 2 answer: primary in each view must acknowledge that view to viewserver viewserver must stay with current view until acknowledged even if the primary seems to have failed no point in proceeding since not acked == backup may not be initialized Example from lab writeup [http://css.csail.mit.edu/6.824/2014/labs/lab-2.html] Q: can more than one server think it is primary? 1: S1, S2 net broken, so viewserver thinks S1 dead but it's alive 2: S2, -- now S1 alive and not aware of view #2, so S1 still thinks it is primary AND S2 alive and thinks it is primary => split brain, no good how to ensure only one server acts as primary? even though more than one may *think* it is primary "acts as" == executes and responds to client requests the basic idea: 1: S1 S2 2: S2 -- S1 still thinks it is primary S1 must forward ops to S2 S2 thinks S2 is primary so S2 can reject S1's forwarded ops the rules: 1. primary in view i must have been primary or backup in view i-1 2. primary must wait for backup to accept each request (incl. get) 3. primary must wait responding to client until backup acked it execute request 4. non-backup must reject forwarded requests 5. non-primary must reject direct client requests 6. every operation must be before or after state transfer (see below) example: 1: S1, S2 viewserver stops hearing Pings from S1 2: S2, -- it may be a while before S2 hears about view #2 before S2 hears about view #2 S1 can process ops from clients, S2 will accept forwarded requests S2 will reject ops from clients who have heard about view #2 after S2 hears if S1 receives client request, it will forward, S2 will reject so S1 can no longer act as primary S1 will send error to client, client will ask vs for new view, client will re-send to S2 the true moment of switch-over occurs when S2 hears about view #2 how can new backup get state? e.g. the key/value state if S2 is backup in view i, but was not in view i-1, S2 should ask primary to transfer the complete state rule for state transfer: every operation (Put/Get) must be either before or after state xfer if before, xferred state must reflect operation if after, primary must forward operation after xfer finishes Q: does primary need to forward Get()s to backup? after all, Get() doesn't change anything, so why does backup need to know? and the extra RPC costs time A: Remember Get/Put/PutHash must provide exactly once semantics. C1 sends Get to P, but not to B C1 doesn't receive response C2 sends Put, which is processed P and B P fails B takes over C1 resends Get, and observes Put, even though it shoudn't Q: how could we make primary-only Get()s work? Q: are there cases when the lab 2 protocol cannot make forward progress? Yes. Many. View service fails Primary fails before it acknowledges view Backup fails before it gets new state .... We will start fixing those in lab 3