6.824 2015 Lecture 6: Raft this lecture larger topic is fault tolerance via replicated state machines Raft -- a much more complete design than straight Paxos Russ Cox of Google will talk about Go on Thursday submit your questions early, so we can get them to Russ Raft overview clients -> leader -> followers -> logs -> execution Raft vs Paxos? Our use of Paxos: agrees separately on each client operation Raft: agrees on each new leader (and on tail of log) agreement not required for most client operations Raft is Paxos optimized for log appends (more or less) why Raft-style leader? no dueling proposers (unless leader fails) fewer messages, less complexity (unless leader fails) well-defined notion of one log being more complete than another simplifies switching leaders (and maybe crash recovery) what about understandability? you must decide for yourself straight Paxos is simpler than Raft but straight Paxos is too simple for practical replication everyone extends it in their own way and ends up with something more or less like Raft Paxos+log+leader probably not simpler than Raft though presumably depends on which Paxos variant you choose is more direct use of Paxos (like Lab 3) ever a win? i.e. is a Raft-style leader ever a bad idea? geographically spread peers a single leader would be far from some clients some peers would be slow to other peers (paxos tolerates lag) let's start w/ Raft w/ no leader change for now, reliable leader followers may be slow or unreachable (but they do not lose state) what do we want? 1. tolerate a minority of failed followers 2. converge on same log` since replication requires same order of execution 3. execute only when entry cannot be lost (committed) since cannot easily un-do execution or reply to client idea for ensuring identical log: leader sends log entry, index, and info about *previous* entry client can reject (e.g I don't have previous entry!) leader backs up for that follower, sends earlier entries -> leader forces followers' logs to be identical to leader's idea for execution: idea #1 means leader knows follower is identical up to some point once a majority are identical up to a point, leader sends that out as commit point, everyone can execute through that point, leader can reply to clients what to do if the leader crashes? other servers time out (no AppendEntries "heart-beats" for a while) choose a new leader! Raft divides time into terms most terms have a leader what are the dangers in transition to a new leader? two leaders no leader might forget an executed log entry logs might end up different (diverge) leader election first, then log consistency at term boundary how to ensure at most one leader in a term? (look at Figure 2, RequestVote RPC, and Rules for Servers) leader must get votes from a majority of servers server can cast only one vote per term thus at most one server can think it has won why a majority? the answer is always the same! allows fault tolerance (failure of minority doesn't impede progress) prevents split brain (at most one candidate can get a majority) ensures overlap (at least one in majority has every previously committed log entry) could election fail to choose any leader? yes! >= 3 candidates split the vote evenly or even # of live servers, two candidates each get half what happens after an election in which no-one gets majority? timeout, increment term, new election higher term takes precedence, candidates for older terms quit note: timeout must be longer than it takes to complete election! note: this means some terms may have no leader, no log entries how does Raft reduce chances of election failure due to split vote? each server delays a random amount of time before starting candidacy why is the random delay useful? [diagram of times at which servers' delays expire] one will choose lowest random delay hopefully enough time to elect before next delay expires how to choose the random delay range? too short: 2nd candidate starts before first finishes too long: system sits idle for too long after leader fails a rough guide: suppose it takes 10ms to complete an unopposed election and there are five servers we want delays to be separated by (say) 20ms so random delay from 0 to 100 ms plus a few multiples of leader heartbeat interval remember this random delay idea! it's a classic scheme for decentralized soft election; e.g. ethernet Raft's elections follow a common pattern: separation of safety from progress *hard* mechanisms ensure < 2 leaders in one term problem: elections can fail (e.g. 3-way split) solution: always safe to start a new election in a new term problem: repeated elections can prevent any work getting done solution: *soft* mechanisms reduce probability of wasted elections heartbeat from leader (remind servers not to start election) timeout period (don't start election too soon) random delays (give one leader time to be elected) what if old leader isn't aware a new one is elected? perhaps b/c old leader didn't see election messages new leader means a majority of servers have incremented currentTerm so old leader (w/ old term) can't get majority for AppendEntries though a minority may accept old server's log entries... so logs may diverge at end of old term... now let's switch topics to data handling at term boundaries what do we want to ensure? each server executes the same client cmds, in the same order i.e. if any server executes, then no server executes something else for that log entry as long as single leader, we've already seen it makes logs identical what about when leader changes? what's the danger? leader of term 3 crashed while sending AppendEntries S1: 3 S2: 3 3 S3: 3 3 S2 and S3 might have executed; does Raft preserve it? may be a series of crashes, e.g. S1: 3 S2: 3 3 4 S3: 3 3 5 thus diff entries for the same index! roll-back is a big hammer -- forces leader's log on everyone in above examples, whoever is elected imposes log on everyone example: S3 is chosen as new leader for term 6 S3 wants to send out a new entry (in term 6) AppendEntries says previous entry must have term 5 S2 replies false (AppendEntries step 2) S3 decrements nextIndex[S2] AppendEntries for the term=5 op, saying prev has term=3 S2 deletes op from term 4 (AppendEntries step 3) (and S1 rejects b/c it doesn't have anything in that entry) ok, leader will force its own log on followers but that's not enough! can roll-back delete an executed entry? when is a log entry executed? when leader advances commitIndex/leaderCommit when a majority match the leader up through this point could new leader roll back executed entries from end of previous term? i.e. could an executed entry be missing from the new leader's log? Raft needs to ensure new leader's log contains every potentially executed entry i.e. must forbid election of server who might be missing an executed entry what are the election rules? Figure 2 says only vote if candidate's log "at least as up to date" So leader will be at least as up to date as a majority what does "at least as up to date" mean? could it mean log is >= length? no; example: S1: 5 6 7 S2: 5 8 S3: 5 8 first, could this scenario happen? how? S1 leader in epoch 6; crash+reboot; leader in epoch 7; crash and stay down both times it crashed after only appending to its own log S2 leader in epoch 8, only S2+S3 alive, then crash who should be next leader? S1 has longest log, but entry 8 is committed !!! Raft adopts leader's log, so S1 as leader -> un-commit entry 8 incorrect since S2 may have replied to client so new leader can only be one of S2 or S3 i.e. the rule cannot be simply "longest log" end of 5.4.1 explains "at least as up to date" voting rule compare last entry higher term wins if equal terms, longer log wins so: S1 can't get any vote from S2 or S3, since 7 < 8 S1 will vote for either S2 or S3, since 8 > 7 S1's operations from terms 6 and 7 will be discarded! ok since no majority -> not executed -> no client reply the point: "at least as up to date" rule causes new leader to have all executed entries in its log so new leader won't roll back any executed operation similar to Paxos: new round ends up using chosen value (if any) of prev round The Question figure 7, which of a/d/f could be elected? i.e. majority of votes from "less up to date" servers? the most subtle thing about Raft (figure 8) not 100% true that a log entry on a majority is committed i.e. will never be forgotten figure 8 describes an exception S1: 1 2 4 S2: 1 2 S3: 1 2 S4: 1 S5: 1 3 S1 was leader in term 2, sends out two copies of 2 S5 leader in term 3 S1 in term 4, sends one more copy of 2 (b/c S3 rejected op 4) what if S5 now becomes leader? S5 can get a majority (w/o S1) S5 will roll back 2 and replace it with 3 could 2 have executed? it is on a majority... so could S1 have mentioned it in leaderCommit after majority? no! very end of Figure 2 says "log[N].term == currentTerm" and S1 was in term 4 when sending 3rd copy of 2 what's Raft's actual commit point? bottom-right of page 310 "committed once the leader that created the entry has replicated on majority" and commit point of one entry commits all before it which is how 2 *could* have committed if S1 hadn't lost leadership another topic: configuration change (Section 6) configuration = set of servers how does Raft change the set of servers? e.g. every few years might want to retire some, add some or move all at once to an entirely new set of server or increase/decrease the number of servers how might a *broken* configuration change work? each server has the list of servers in the current config change configuation by changing lists, one by one example: want to replace S3 with S4 S1: 1,2,3 1,2,4 S2: 1,2,3 1,2,3 S3: 1,2,3 1,2,3 S4: 1,2,4 1,2,4 OOPS! now *two* disjoint group/leaders can form: S2 and S3 (who know nothing of new config) S1 and S4 both can process client requests, so split brain Raft configuration change idea: "join consensus" stage that includes *both* old and new configuration leader of old group logs entry that switches to joint consensus during joint consensus, leader separately logs in old and new i.e. *two* log and *two* agreements on each log entry this will force new servers to catch up and force new and old logs to be the same after majority of old and new have switched to joint consensus, leader logs entry that switches to final configuration S1: 1,2,3 1,2,3+1,2,4 S2: 1,2,3 S3: 1,2,3 S4: 1,2,3+1,2,4 if crash but new leader didn't see the switch to joint consensus, then old group will continue, no switch, but that's OK if crash and new leader did see the switch to joint consensus, it will complete the configuration change performance no numbers on how fast it can process requests what are the bottlenecks likely to be? disk: need to write disk for client data durability, and for protocol promises write per client request? so 100 per second? could probably batch and get 10,000 to 100,000 net: a few message exchanges per client request 10s of microseconds for local LAN message exchange? so 100,000 per second? Thursday: Russ Cox of Google on Go next week: use of a Raft-like protocol in a complex application