6.824 2015 Lecture 3: Primary/Backup Replication

Today
  Replication
  Remus case study
  Lab 2 introduction

Fault tolerance
  we'd like a service that continues despite failures!
  available: still useable despite [some class of] failures
  correct: act just like a single server to clients
  very hard!
  very useful!

Need a failure model: what will we try to cope with?
  Independent fail-stop computer failure
    Remus further assumes only one failure at a time
  Site-wide power failure (and eventual reboot)
  (Network partition)
  No bugs, no malice

Core idea: replication
  *Two* servers (or more)
  Each replica keeps state needed for the service
  If one replica fails, others can continue

Example: fault-tolerant MapReduce master
  lab 1 workers are already fault-tolerant, but not master
    master is a "single point of failure"
  can we have two masters, in case one fails?
  [diagram: M1, M2, workers]
  state:
    worker list
    which jobs done
    which workers idle
    TCP connection state
    program counter

Big Questions:
  What state to replicate?
  How does replica get state?
  When to cut over to backup?
  Are anomalies visible at cut-over?
  How to repair / re-integrate?

Two main approaches:
  State transfer
    "Primary" replica executes the service
    Primary sends [new] state to backups
  Replicated state machine
    All replicas execute all operations
    If same start state,
      same operations,
      same order,
      deterministic,
      then same end state

State transfer is simpler
  But state may be large, slow to transfer
  Remus uses state transfer

Replicated state machine can be more efficient
  If operations are small compared to data
  But complex, e.g. order on multi-core, determinism
  Labs use replicated state machines

Remus: High Availability via Asynchronous Virtual Machine Replication
NSDI 2008

Very ambitious system:
  Whole-system replication
  Completely transparent to applications and clients
  High availability for any existing software
  Would be magic if it worked well!
  Failure model:
    1. independent hardware faults
    2. site-wide power failure

Plan 1 (slow, broken):
  [diagram: app, O/S, Remus underneath]
  two machines, primary and backup; plus net and other machines
  primary runs o/s and application s/w, talks to clients, &c
  backup does *not* initially execute o/s, applications, &c
    it only executes some Remus code
  a few times per second:
    pause primary
    copy entire RAM, registers, disk to backup
    resume primary
  if primary fails:
    start backup executing!

Q: is Plan 1 correct?
   i.e. does it look just like a single reliable server?

Q: what will outside world see if primary fails and replica takes over?
   will backup have same state as last visible on primary?
   might a client request be lost? executed twice?

Q: is Plan 1 efficient?

Can we eliminate the fact that backup *state* trails the primary?
  Seems very hard!
  Primary would have to tell backup (and wait) on every instruction.

Can we *conceal* the fact that backup's state lags primary?
  Prevent outside world from *seeing* that backup is behind last primary state
    e.g. prevent primary sent RPC reply but backup state doesn't reflect that RPC
    e.g. MR Register RPC, which it would be bad for backup to forget
  Idea: primary "holds" output until backup state catches up to output point
    e.g. primary receives RPC request, processes it, creates reply packet,
    but Remus holds reply packet until backup has received corresponding state update

Remus epochs, checkpoints
  Clients:    C1
               req1                       reply1
  Primary:    ... E1 ... | pause |  E2  release   | pause |
                           ckpt        ok           ckpt
  Backup:      ... (E0) ... |  apply  | (E1)         | 
  1. Primary runs for a while in Epoch 1, holding E1's output
  2. Primary pauses
  3. Primary sends RAM+disk changes to backup (in background)
  4. Primary resumes execution in E2, holding E2's output
  5. Backup copies all to separate RAM, then ACKs
  6. Primary releases E1's output
  7. Backup applies E1's changes to RAM and disk

If primary fails, backup finishes applying last epoch's disk+RAM,
  then starts executing

Q: any externally visible anomalies?
   lost input/output?
   repeated output?

Q: what if primary receives+executes a request, crashes before checkpoint?
   backup won't have seen request!

Q: what if primary crashes after sending ckpt to backup,
   but before releasing output?

Q: what if client doesn't use TCP -- doesn't re-transmit?

Q: what if primary fails while sending state to backup?
   i.e. backup is mid-way through absorbing new state?

Q: are there situations in which Remus will incorrectly activate the backup?
   i.e. primary is actually alive
   network partition...

Q: when primary recovers, how does Remus restore replication?
   needed, since eventually active ex-backup will itself fail

Q: what if *both* fail, e.g. site-wide power failure?
   RAM content will be lost, but disks will probably survive
   after power is restored, reboot guest from one of the disks
     O/S and application recovery code will execute
   disk must be "crash-consistent"
     so probably not the backup disk if was in middle of installing checkpoint
   disk shouldn't reflect any held outputs (... why not?)
     so probably not the primary's disk if was executing
   I do not understand this part of the paper (Section 2.5)
     seems to be a window during which neither disk could be used if power failed
       primary writes its disk during epoch
       meanwhile backup applies last epoch's writes to its disk

Q: in what situations will Remus likely have good performance?

Q: in what situations will Remus likely have low performance?

Q: should epochs be short or long?

Remus Evaluation
  summary: 1/2 to 1/4 native speed
  checkpoints are big and take time to send
  output hold limits speed at which clients can interact

Why so slow?
  checkpoints are big and take time to generate and send
    100ms for SPECweb2005 -- because many pages written
  so inter-checkpoint intervals must be long
  so output must be held for quite a while
  so client interactions are slow
    only 10 RPCs per second per client

How could one get better performance for replication?
  big savings possible with application-specific schemes:
    just send state really needed by application, not all state
    send state in optimized format, not whole pages
    send operations if they are smaller than state
  likely *not* transaparent to applications
    and probably not to clients either

PRIMARY-BACKUP REPLICATION IN LAB 2

outline:
  simple key/value database
    Get(k), Put(k, v), Append(k, v)
  primary and backup
  replicate by primary sending each operation to backups
  tolerate network problems, including partition
    either keep going, correctly
    or suspend operations until network is repaired
  allow replacement of failed servers
  you implement essentially all of this (unlike lab 1)

"view server" decides who p and b are
  main goal: avoid "split brain" -- disagreement about who primary is
  clients and servers ask view server
  they don't make independent decisions

repair:
  view server can co-opt "idle" server as b after old b becomes p
  primary initializes new backup's state

key points:
  1. only one primary at a time!
  2. the primary must have the latest state!
  we will work out some rules to ensure these

view server
  maintains a sequence of "views"
    view #, primary, backup
    0: -- --
    1: S1 --
    2: S1 S2
    4: S2 --
    3: S2 S3
  monitors server liveness
    each server periodically sends a Ping RPC
    "dead" if missed N Pings in a row
    "live" after single Ping
  can be more than two servers Pinging view server
    if more than two, "idle" servers
  if primary is dead
    new view with previous backup as primary
  if backup is dead, or no backup
    new view with previously idle server as backup
  OK to have a view with just a primary, and no backup
    but -- if an idle server is available, make it the backup

how to ensure new primary has up-to-date replica of state?
  only promote previous backup
  i.e. don't make an idle server the primary
  backup must remember if it has been initialized by primary
    if not, don't function as primary even if promoted!

Q: can more than one server think it is primary?
   1: S1, S2
      net broken, so viewserver thinks S1 dead but it's alive
   2: S2, --
   now S1 alive and not aware of view #2, so S1 still thinks it is primary
   AND S2 alive and thinks it is primary
   => split brain, no good

how to ensure only one server acts as primary?
  even though more than one may *think* it is primary
  "acts as" == executes and responds to client requests
  the basic idea:
    1: S1 S2
    2: S2 --
    S1 still thinks it is primary
    S1 must forward ops to S2
    S2 thinks S2 is primary
    so S2 must reject S1's forwarded ops

the rules:
  1. primary in view i must have been primary or backup in view i-1 
  2. if you think you are primary, must wait for backup for each request
  3. if you think you are not backup, reject forwarded requests
  4. if you think you are not primary, reject direct client requests

so:
  before S2 hears about view #2
    S1 can process ops from clients, S2 will accept forwarded requests
    S2 will reject ops from clients who have heard about view #2
  after S2 hears about view #2
    if S1 receives client request, it will forward, S2 will reject
      so S1 can no longer act as primary
    S1 will send error to client, client will ask vs for new view,
       client will re-send to S2
  the true moment of switch-over occurs when S2 hears about view #2

how can new backup get state?
  e.g. all the keys and values
  if S2 is backup in view i, but was not in view i-1,
    S2 should ask primary to transfer the complete state

rule for state transfer:
  every operation (Put/Get/Append) must be either before or after state xfer
    == state xfer must be atomic w.r.t. operations
  either
    op is before, and xferred state reflects op
    op is after, xferred state doesn't reflect op, prim forwards op after state

Q: does primary need to forward Get()s to backup?
   after all, Get() doesn't change anything, so why does backup need to know?
   and the extra RPC costs time

Q: how could we make primary-only Get()s work?

Q: are there cases when the lab 2 protocol cannot make forward progress?
   View service fails
   Primary fails before new backup gets state
   We will start fixing those in lab 3