6.824 2017 Lecture 6: Raft (2)

recall the big picture:
  key/value service as the example, as in Lab 3
  [diagram: clients, k/v layer, k/v table, raft layer, raft log]
  goal: same client-visible behavior as single non-replicated server
    but available despite minority of failed/disconnected servers

*** topic: the Raft log (Lab 2B)

let's talk about synchronizing logs after leader change
  typically due to failure of previous leader

what do we want to ensure?
  if any server executes a given command in a log entry,
    then no server executes something else for that log entry
  (Figure 3's State Machine Safety)
  why? if the servers disagree on the operations, then a
    change of leader might change the client-visible state,
    which violates our goal of mimicing a single server.
  example:
    S1: put(k1,v1) | put(k1,v2) | ...
    S2: put(k1,v1) | put(k2,x)  | ...
    can't allow both to execute their 2nd log entries!

how can logs disagree after a crash?
  a leader crashes before sending last AppendEntries to all
    S1: 3
    S2: 3 3
    S3: 3 3
  worse: logs might have different commands in same entry!
    after a series of leader crashes, e.g.
        10 11 12 13  <- log entry #
    S1:  3
    S2:  3  3  4
    S3:  3  3  5

new leader will force its log on followers; example:
  S3 is chosen as new leader for term 6
  S3 sends an AppendEntries with entry 13
     prevLogIndex=12
     prevLogTerm=5
  S2 replies false (AppendEntries step 2)
  S3 decrements nextIndex[S2] to 12
  S3 sends AppendEntries w/ entry 12, prevLogIndex=11, prevLogTerm=3
  S2 deletes its entry 12 (AppendEntries step 3)
  similar story for S1, but have to go back one farther

the result of roll-back:
  each live follower deletes tail of log that differs from leader
  then each live follower accepts leader's entries after that point

could new leader roll back *committed* entries from end of previous term?
  i.e. could a committed entry be missing from the new leader's log?
  this would be a disaster -- old leader might have already said "yes" to a client
  so: Raft needs to ensure elected leader has all committed log entries

why not elect the server with the longest log as leader?
  example:
    S1: 5 6 7
    S2: 5 8
    S3: 5 8
  first, could this scenario happen? how?
    S1 leader in term 6; crash+reboot; leader in term 7; crash and stay down
      both times it crashed after only appending to its own log
    S2 leader in term 8, only S2+S3 alive, then crash
  who should be next leader?
    S1 has longest log, but entry 8 could have committed !!!
    so new leader can only be one of S2 or S3
    i.e. the rule cannot be simply "longest log"

end of 5.4.1 explains the "election restriction"
  RequestVote handler only votes for candidate who is "at least as up to date":
    candidate has higher term in last log entry, or
    candidate has same last term and same length or longer log
  so:
    S2 and S3 won't vote for S1
    S2 and S3 will vote for each other
  so only S2 or S3 can be leader, will force S1 to discard 6,7
    ok since 6,7 not on majority -> not committed -> no reply sent to clients
    -> clients will resend commands in 6,7

the point:
  "at least as up to date" rule ensures new leader's log contains
    all potentially committed entries
  so new leader won't roll back any committed operation

The Question (from last lecture)
  figure 7, top server is dead; which of a/d/f can be elected?

depending on who is elected leader in Figure 7, different entries
  will end up committed or discarded
  c's 6 and d's 7,7 may be discarded OR committed
  some will always remain committed: 111445566

how to roll back quickly
  the Figure 2 design backs up one entry per RPC -- slow!
  lab tester probably requires faster roll-back
  S1: 4 5 5      4 4 4      4
  S2: 4 6 6  or  4 6 6  or  4 6 6
  S3: 4 6 6      4 6 6      4 6 6
  S3 is leader for term 6, S1 comes back to life
  paper outlines a scheme towards end of Section 5.3
    no details; here's my guess; better schemes are possible
  if follower rejects, includes this in reply:
    the follower's term in the conflicting entry
    the index of follower's first entry with that term
  if leader knows about the conflicting term:
    move nextIndex[i] back to leader's last entry for the conflicting term
  else:
    move nextIndex[i] back to follower's first index

*** topic: persistence (Lab 2C)

what would we like to happen after a server crashes?
  we want reboot and re-join, to prepare for another failed server
  we also want to cope with power failure: all servers crash+reboot
  so each server's essential state must persist across crashes

if a server crashes and restarts, what must Raft remember?
  Figure 2 lists "persistent state":
    log[], currentTerm, votedFor
  a Raft server can only re-join after restart if these are intact
  thus it must save them to non-volatile storage
    save after each change
    before sending any RPC or RPC reply
  non-volatile = disk, SSD, &c
  why log[]?
    if a rebooted server was in leader's majority for committing an entry,
      but then forgets, a future leader might not see the committed log entry
  why currentTerm/votedFor?
    to prevent a client from voting for one candidate, then reboot,
      then vote for a different candidate in the same (or older!) term
    could lead to two leaders for the same term

what if the server crashes while writing its persistent state?

some Raft state is volatile
  commitIndex, lastApplied, next/matchIndex[]
  Raft's algorithms reconstruct them from initial values

persistence is often the bottleneck for performance
  a hard disk write takes 10 ms, SSD write takes 0.1 ms
  an RPC takes well under 1 ms (within a single data center)
  so we expect 100 to 10,000 ops/second
  can do better by batching many new log entries per disk write

how does the service (e.g. k/v server) recover its state after a crash+reboot?
  easy approach: start with empty state, re-play Raft's entire persisted log
    lastApplied is non-volatile and starts at zero, so you may need no extra code!
  but re-play will be too slow for a long-lived system
  faster: use Raft snapshot and replay just the tail of the log

*** topic: log compaction and Snapshots (Lab 3B)

problem:
  log will get to be huge -- much larger than state-machine state!
  will take a long time to re-play on reboot or send to a new server

luckily:
  1. log contains a lot of redundancy, e.g.
     put(k1,v1) | put(k1,v2) | put(k1,v3) | ...
     so there is a lot of old stuff we don't need in the log
  2. log and service's state are nearly equivalent
     e.g. each server has a copy of the complete key/value table
  a server doesn't need to keep *both* the complete log *and* the service state

what constrains how a server can discard old parts of log?
  can't forget un-committed entries -- might be part of leader's majority
  need to replay committed entries if crash and restart
  may be needed to bring other servers up to date

solution: service periodically creates persistent "snapshot"
  [diagram: service with state, snapshot on disk, raft log, raft persistent]
  copy of entire state-machine state as of execution of a specific log entry
    e.g. k/v table
  service writes snapshot to persistent storage (disk)
  service tells Raft it is snapshotted through some log index
  Raft discards log before that index
  a server can create a snapshot and discard prefix of log at any time
    e.g. when log grows too long

relation of snapshot and log
  snapshot reflects only executed log entries
    and thus only committed entries
  so server will only discard committed prefix of log
    anything not known to be committed will remain in log

so a server's on-disk state consists of:
  service's snapshot up to a certain log entry
  Raft's persisted log w/ following log entries
  the combination is equivalent to the full log

what happens on crash+restart?
  service reads snapshot from disk
  Raft reads persisted log from disk
    sends service entries that are committed but not in snapshot

what if a follower lags and leader has discarded past end of follower's log?
  nextIndex[i] will back up to start of leader's log
  so leader can't repair that follower with AppendEntries RPCs
  thus the InstallSnapshot RPC
  (Q: why not have leader discard only entries that *all* servers have?)

what's in an InstallSnapshot RPC? Figures 12, 13
  term
  lastIncludedIndex
  lastIncludedTerm
  snapshot data

what does a follower do w/ InstallSnapshot?
  reject if term is old (not the current leader)
  reject (ignore) if follower already has last included index/term
    it's an old/delayed RPC
  follower empties its log, replaces with fake "prev" entry
  set lastApplied to lastIncludedIndex
  send snapshot in applyCh to service
  service replaces its k/v table with snapshot contents

note that the state and the operation history are roughly equivalent
  replication protocol designers can choose which to send or store
  e.g. last few operations (log entries) for lagging replica,
    but entire state (snapshot) for a replica that has lost its disk.
  still, replica repair can be very expensive, and warrants much attention

The Question:
  Could a received InstallSnapshot RPC cause the state machine to go
  backwards in time? That is, could step 8 in Figure 13 cause the state
  machine to be reset so that it reflects fewer executed operations? If
  yes, explain how this could happen. If no, explain why it can't
  happen.

*** topic: configuration change (not needed for the labs)

configuration change (Section 6)
  configuration = set of servers
  sometimes you need to
    move to a new set of servers, or
    increase/decrease the number of servers
  human initiates configuration change, Raft manages it
  we'd like Raft to cope correctly with failure during configuration change

why doesn't a straightforward approach work?
  suppose each server has the list of servers in the current config
  change configuration by telling each server the new list
    using some mechanism outside of Raft
  problem: they will learn new configuration at different times
  example: want to replace S3 with S4
    S1: 1,2,3  1,2,4
    S2: 1,2,3  1,2,3
    S3: 1,2,3  1,2,3
    S4:        1,2,4
  OOPS! now *two* leaders could be elected!
    S2 and S3 could elect S2
    S1 and S4 could elect S1

Raft configuration change
  idea: "joint consensus" stage that includes *both* old and new configuration
    avoids any time when both old and new can choose leader independently
  system starts with Cold
  system administrator asks the leader to switch to Cnew
  Raft has special configuration log entries (sets of server addresses)
  each server uses the last configuration in its own log
  1. leader commits Cold,new to a majority of both Cold and Cnew
  2. after Cold,new commits, leader commits Cnew to servers in Cnew

what if leader crashes at various points in this process?
  can we have two leaders for the next term?
  if that could happen, each leader must be on of these:
    A. in Cold, but does not have Cold,new in log
    B. in Cold or Cnew, has Cold,new in log
    C. in Cnew, has Cnew in log
  we know we can't have A/A or C/C by the usual rules of leader election
  AB? no, since B needs majority from Cold as well as Cnew
  AC? no, since can't proceed to Cnew until Cold,new committed to Cold
  BB? no, since B needs majority from both Cold and Cnew
  BC? no, since B needs majority from Cnew as well as Cold

good! Raft can switch to a new set of servers w/o risk of two active leaders

*** topic: performance

Note: many situations don't require high performance.
  key/value store might.
  but GFS or MapReduce master might not.

Most replication systems have similar common-case performance:
  One RPC exchange and one disk write per agreement.
  So Raft is pretty typical for message complexity.

Raft makes a few design choices that sacrifice performance for simplicity:
  Raft follower rejects out-of-order AppendEntries RPCs.
    Rather than saving for use after hole is filled.
    Might be important if network re-orders packets a lot.
    And makes leader->follower RPC pipelining harder.
  Snapshotting is wasteful for big slowly-changing states.
  A slow leader may hurt Raft, e.g. in geo-replication.

Experience suggests these have a big effect on performance:
  Disk writes for persistence.
  Message/packet/RPC overhead.
  Need to execute logged commands sequentially.
  Fast path for read-only operations.

Papers with more attention to performance:
  Zookeeper/ZAB; Paxos Made Live; Harp