6.824 2015 Lecture 8: Harp
==========================

Replication in the Harp File System
Liskov, Ghemawat, Gruber, Johnson, Shrira, Williams
SOSP 1991

Why are we reading this paper?
  Harp was the first complete replication system that dealt w/ partition
  It's a complete case study of a replicated service (file server)
  It uses Raft-like replication techniques

How could a 1991 paper still be worth reading?
  Harp introduced techniques that are still widely used
  There are few papers describing complete replicated systems

The paper is a mix of fundamentals and incidentals
  We care a lot about replication
  We may not care much about NFS specifically
    But we care a lot about the challenges faced when integrating
    a real application with a replication protocol.
  And we care about where optimization is possible.

I'm going to focus on parts of Harp that aren't already present in Raft.
  But note that Harp pre-dates Raft by 20+ years.
  Raft is, to a great extent, a tutorial on ideas pioneered by Harp.
    Though they differ in many details.

What does Harp paper explain that Raft paper does not?
  Adapting a complex service to state machine abstraction
    e.g. possibility of applying an operation twice
  Lots of optimizations
    pipelining of requests to backup
    witness
    primary-only execution of read-only operations using leases
  Efficient re-integration of re-started server with large state
  Power failure, including simultaneous failure of all servers
  Efficient persistence on disk

Harp authors had not implemented recovery yet
  Earlier paper (1988) describes View-stamped Replication:
    http://www.pmg.csail.mit.edu/papers/vr.pdf
  Later (2006) paper has clearer description, though a bit different:
    http://pmg.csail.mit.edu/papers/vr-revisited.pdf

Basic setup is familiar
  Clients, primary, backup(s), witness(es).
  Client -> Primary
  Primary -> Backups
  Backups -> Primary
    Primary waits for all backups / promoted witnesses in current view
    Commit point
  Primary -> execute and reply to Client
  Primary -> tell Backups to Commit

Why are 2b+1 servers necessary to tolerate b failures?
  Requiring majority for each operation helps ensure no other partition.
    And majority intersection helps ensure state persists to future views.
  Not waiting for b allows tolerance of b failures.

What are Harp's witnesses?
  The witnesses are one significant difference from Raft.
  The b witnesses do not ordinarily hear about operations or keep state.
  Why is that OK?
    b+1 of 2b+1 do have state
    So any b failures leaves at least one live copy of state.
  Why are the b witnesses needed at all?
    If b replicas with state do fail, witnesses give the required b+1 majority.
    To ensure that only one partition operates -- no split brain.
  So, for a 3-server system, the witness is there to break ties about
    which partition is allowed to operate when primary and backup are
    in different partitions. The partition with the witness wins.

Does primary need to send operations to witnesses?
  The primary must collect ACKs from a majority of the 2b+1 for every r/w operation.
    To ensure that it is still the primary -- still in the majority partition.
    To ensure that operation is on enough servers to intersect with any
      future majority that forms a new view.
  If all backups are up, primary+backups are enough for that majority.
  If m backups are down:
    Primary must talk to m "promoted" witnesses to get its majority for each op.
    Those witnesses must record the op, to ensure overlap with any
      future majority.
    Thus each "promoted" witness keeps a log.

So in a 2b+1 system, a view always has b+1 servers that the primary
  must contact for each op, and that store each op.

Note: somewhat different from Raft
  Raft keeps sending each op to all servers, proceeds when majority answer
    So leader must keep full log until failed server re-joins
  Harp eliminates failed server from view, doesn't send to it
    Only witness has to keep a big log; has special plan (ram, disk, tape).
  The bigger issue is that it can take a lot of work to bring
    a re-joining replica up to date; careful design is required.

What's the story about the UPS?
  This is one of the most interesting aspects of Harp's design
  Each server's power cord is  plugged into a UPS
  UPS has enough battery to run server for a few minutes
  UPS tells server (via serial port) when main A/C power fails
  Server writes dirty FS blocks and Harp log to disk, then shuts down

What does the UPS buy for Harp?
  Efficient protection against A/C power failure of ALL servers
  For failures of up to b servers, replication is enough
  If *all* servers failed and lost state, that's more than b failures,
    so Harp has no guarantee (and indeed no state!)
  With UPS:
    Each server can reply *without* writing disk!
    But still guarantees to retain latest state despite simultaneous power fail
  But note:
    UPS does not protect against other causes of simultaneous failure
    e.g. bugs, earthquake
    Harp treats servers that re-start after UPS-protected crash differently
      than those that re-start with crash that lost in-memory state
    Because the latter may have forgotten *committed* operations

Larger point, faced by every fault-tolerant system
  Replicas *must* keep persistent state to deal w/ failure of all servers
    Committed operations
    Latest view number, proposal number, &c
  Must persist this state before replying
  Writing every commit to disk is very slow!
    10 ms per disk write, so only 100 ops/second
  So there are a few common patterns:
    1. Low throughput
    2. Batching, high delay
    3. Lossy or inconsistent recovery from simultaneous failure
    4. Batteries, flash, SSD w/ capacitor, &c

Let's talk about Harp's log management and operation execution
  Primary and backup must apply client operations to their state
  State here is a file system -- directories, file, owners, permissions, &c
  Harp must mimic an ordinary NFS server to the client
    i.e. not forget about ops for which it has sent a reply

What is in a typical log record?
  Client's NFS operation (write, mkdir, chmod, &c)
  Shadow state: modified i-nodes and directory content *after* execution
  Client RPC request ID, for duplicate detection
  Reply to send to client, for duplicate detection

Why does Harp have so many log pointers?
  FP most recent client request
  CP commit point (real in primary, latest heard in backup)
  AP highest update sent to disk
  LB disk has finished writing up to here
  GLB all nodes have completed disk up to here

Why the FP-CP gap?
  So primary doesn't need to wait for ACKs from each backup
    before sending next operation to backups
  Primary pipelines ops CP..FP to the backups.
  Higher throughput if concurrent client requests.

Why the AP-LB gap?
  Allows Harp to issue many ops as disk writes before waiting for disk
  The disk is more efficient if it has lots of writes (e.g. arm scheduling)

What is the LB?
  This replica has everything <= LB on disk.
  So it won't need those log records again.

Why the LB-GLB gap?
  GLB is min(all servers' LBs).
  GLB is earliest record that *some* server might need if it loses memory.

When does Harp execute a client operation?
  There are two answers!
  1. When operation arrives, primary figures out exactly what should happen.
     Produces resulting on-disk bytes for modified i-nodes, directories, &c.
     This is the shadow state.
     This happens before the CP, so the primary must consult recent
     operations in the log to find the latest file system state.
  2. After the operation commits, primary and backup can apply it to
     their file systems.
     They copy the log entry's shadow state to the file system;
     they do not really execute the operation.
     And now the primary can reply to the client's RPC.

Why does Harp split execution in this way?
  If a server crashes and reboots, it is brought up to date by replaying
  log entries it might have missed. Harp can't know exactly what the
  last pre-crash operation was, so Harp may repeat some. It's not
  correct to fully execute some operations twice, e.g. file append.
  So Harp log entries contain the *resulting* state, which is what's applied.

The point: multiple replay means replication isn't transparent to the service.
  Service must be modified to generate and accept the state
    modifications that result from client operations.
  In general, when applying replication to existing services,
    the service must be modified to cope with multiple replay.

Can Harp primary execute read-only operations w/o replicating to backups?
  e.g. reading a file.
  Would be faster -- after all, there's no new data to replicate.
  What's the danger?
  Harp's idea: leases
  Backups promise not to form a new view for some time.
  Primary can execute read-only ops locally for that time minus slop.
  Depends on reasonably synchronized clocks:
    Primary and backup must have bounded disagreement
    on how fast time passes.

What state should primary use when executing a read-only operation?
  Does it have to wait for all previously arrived operations to commit?
    No! That would be almost as slow as committing the read-only op.
  Should it look at state as of operation at FP, i.e. latest r/w operation?
    No! That operation has not committed; not allowed to reveal its effects.
  Thus Harp executes read-only ops with state as of CP.
  What if client sends a WRITE and (before WRITE finishes) a READ of same data?
    READ may see data *before* the WRITE!
    Why is that OK?

How does failure recovery work?
   I.e. how does Harp recover replicated state during view change?

Setup for the following scenarios
  5 servers: 1 is usually primary, 2+3 are backups, 4+5 witnesses

Scenario:
  S1+S2+S3; then S1 crashes
  S2 is primary in new view (and S4 is promoted)
  Will S2 have every committed operation?
  Will S2 have every operation S1 received?
  Will S2's log tail be the same as S3's log tail?
  How far back can S2 and S3 log tail differ?
  How to cause S2 and S3's log to be the same?
    Must commit ops that appeared in both S2+S3 logs
    What about ops that appear in only one log?
      In this scenario, can discard since could not have committed
      But in general committed op might be visible in just one log
  From what point does promoted witness have to start keeping a log?

What if S1 crashed just before replying to a client?
  Will the client ever get a reply?

After S1 recovers, with intact disk, but lost memory.
  It will be primary, but Harp can't immediately use its state or log.
    Unlike Raft, where leader only elected if it has the best log.
    Harp must replay log from promoted witness (S4)
  Could S1 have executed an op just before crashing
    that the replicas didn't execute after taking over?
    No, execution up to CP only, and CP is safe on S2+S3.

New scenario: S2 and S3 are partitioned (but still alive)
  Can S1+S4+S5 continue to process operations?
    Yes, promoted witnesses S4+S5
  S4 moves to S2/S3 partition
  Can S1+S5 continue?
    No, primary S1 doesn't get enough backup ACKs
  Can S2+S3+S4 continue?
    Yes, new view copies log entries S4->S2, S4->S3, now S2 is primary
  Note:
    New primary was missing many committed operations
    In general some *committed* operations may be on only one server

New scenario: S2 and S3 are partitioned (but still alive)
  S4 crashes, loses memory contents, reboots in S2/S3 partition
  Can they continue?
    Only if there wasn't another view that formed and committed more ops
  How to detect?
    Depends on what S4's on-disk view # says.
    OK if S4's disk view # is same as S2+S3's.
      No new views formed.
      S2+S3 must have heard about all committed ops in old view.

Everybody suffers a power failure.
  S4 disk and memory are lost, but it does re-start after repair.
  S1 and S5 never recover.
  S2 and S3 save everything on disk, re-start just fine.
  Can S2+S3+S4 continue?
  (harder than it looks)
    No, cannot be sure what state S4 had before failure.
    Might have formed a new view with S1+S5, and committed some ops.

When can Harp form a new view?
  1. No other view possible.
  2. Know view # of most recent view.
  3. Know all ops from most recent view.
  #1 is true if you have n+1 nodes in new view.
  #2 is true if you have n+1 nodes that did not lose view # since last view.
    View # stored on disk, so they just have to know disk is OK.
    One of them *must* have been in the previous view.
    So just take the highest view number.
  And #3?
    Need a disk image, and a log, that together reflect all operations 
      through the end of the previous view.
    Perhaps from different servers, e.g. log from promoted witness,
      disk from backup that failed multiple views ago.

Does Harp have performance benefits?
  In Fig 5-1, why is Harp *faster* than non-replicated server?
  How much win would we expect by substituting RPC for disk operations?

Why graph x=load y=response-time?
  Why does this graph make sense?
  Why not just graph total time to perform X operations?
  One reason is that systems sometimes get more/less efficient w/ high load.
  And we care a lot how they perform w/ overload.

Why does response time go up with load?
  Why first gradual...
    Queuing and random bursts?
    And some ops more expensive than others, cause temp delays.
  Then almost straight up?
    Probably has hard limits, like disk I/Os per second.
    Queue length diverges once offered load > capacity