6.824 2017 Lecture 21: Security: Byzantine Fault Tolerance Paper: "Practical byzantine fault tolerance" by Castro and Liskov (OSDI'99) why are we reading this paper? impressive result solves a strictly harder problem than Raft namely state-machine replication in the presence of malicious replicas many follow-on papers BFT is not widely-used today people rely on prevention and detection of compromised nodes BFT is seeing a come back for bitcoin-like systems Bitcoin solves consensus with malicious participants But, proof-of-work and long delays to resolve forks Stellar generalizes PBFT for federated deployments IBM's hyperledger uses PBFT we've considered many fault-tolerance protocols have always assumed "fail-stop" failures -- like power failure i.e. servers follow the protocol hard enough: crash vs network down; network partition can one handle a larger class of failures? buggy servers, that compute incorrectly rather than stopping? servers that *don't* follow the protocol? servers that have been modified by an attacker? often called "Byzantine" faults the paper's approach: replicated state machine assumes 2f+1 of 3f+1 are non-faulty use voting to select the right results not as easy as it might sound let's assume the worst case: a single attacker controls the f faulty replicas and is actively trying to break the system if we can handle this, we can handle bugs in f replicas too what are the attacker's powers? supplies the code that faulty replicas run knows the code the non-faulty replicas are running knows the faulty replicas' crypto keys can read network messages can temporarily force messages to be delayed via DoS what faults *can't* happen? no more than f out of 3f+1 replicas can be faulty no client failure -- clients never do anything bad no guessing of crypto keys or breaking of cryptography example use scenario: RM: echo A > grade echo B > grade tell FK "the grade file is ready" FK: cat grade a faulty system could: totally make up the file contents execute write("A") but ignore write("B") show "B" to RM and "A" to FK execute write("B") only only some of the replicas let's try to design our own byzantine-fault-tolerant RSM start simple (and broken), work towards paper's design design 1: [client, n servers] n servers servers don't have the same exploits different OSes, different service implementations, etc. client and servers have public-key pairs every message is signed by sending recipients authenticate a message client sends request to all servers waits for all n to reply authenticates replies only proceeds if all n agree what's wrong with design 1? one server may have been compromised that one server can stop progress by disagreeing design 2: let's have replicas vote 2f+1 servers, assume no more than f are faulty client waits for f+1 matching replies if only f are faulty, and network works eventually, must get them! what's wrong with design 2's 2f+1? f+1 matching replies might be f bad nodes and just 1 good so maybe only one good node got the operation! *next* operation also waits for f+1 might *not* include that one good node that saw op1 example: (see drawing below) f g1 g2 (f is faulty/bad) everyone hears and replies to write("A") f and g1 reply to write("B"), but g2 misses it client can't wait for g2 since it may be the one faulty server f and g2 reply to read(), but g1 misses it so read() yields "A" result: client tricked into accepting a reply based on out-of-date state e.g. TA reads A instead of B from grades file e.g. Breaks correctness; it should provide linearizability (return B) Put k=A Put k=B Get k -> A (f lies, sending A too) C -------------------------------------------------- ^ ^ ^ \ ^ ^ \ ^ ^ \ / / / \/ / \ / / f -------------------------------------------------- \/ / \/ \ / g1 -------------------------------------------------- \/ \/ g2 -------------------------------------------------- A ==> Need to to ensure that replies include majority of good nodes, so that at least one will return B later design 3: 3f+1 servers, of which at most f are faulty client waits for 2f+1 matching replies == f bad nodes plus a majority of the good nodes so all sets of 2f+1 overlap in at least one good node example (see below): f g1 g2 g2 (f is faulty/bad) everyone hears write("A") f, g1, g2 process write("B"), g3 misses it now the read() client will wait for 2f+1=3 matching replies f and g3 will reply "A" g1 and g2 will reply "B" client doesn't know what to believe (neither is 2f+1) but it is guaranteed to see there's a problem so client can *detect* that some good nodes missed an operation we'll see how to repair in a bit Put k=A Put k=B Get k: A B B A C -------------------------------------------------- ^ ^ ^ ^ \ ^ ^ ^ \ ^ ^ ^ ^ \ / / / / \/ / / \ / / / / f -------------------------------------------------- \/ / / \/ / \ / / / g1 -------------------------------------------------- \/ / \/ \/ / g2 -------------------------------------------------- \/ \/ g3 -------------------------------------------------- what about handling multiple clients? non-faulty replicas must process operations in the same order! let's have a primary to pick order for concurrent client requests but we have to worry about a faulty primary what can a faulty primary do? 1. send wrong result to client 2. different ops to different replicas 3. ignore a client op general approach to handling faulty primary 1. replicas send results direct to client 2. replicas exchange info about ops sent by primary 3. clients notify replicas of each operation, as well as primary each replica watches progress of each operation if no progress, force change of primary can a replica execute an operation when it first receives it from primary? no: maybe primary gave different ops to different replicas if we execute before we're sure, we've wrecked the replica's state need 2nd round of messages to make sure all good replicas got the same op design 4 (with primary): 3f+1 servers, one is primary, f faulty, primary might be faulty client sends request to primary AND to each replica primary chooses next op and op # primary sends PRE-PREPARE(op, n) to replicas each replica sends PREPARE(op, n) to all replicas if replica gets matching PREPARE(op, n) from 2f+1 replicas (incl itself) and n is the next operation # execute the operation, possibly modifying state send reply to client else: keep waiting client is happy when it gets f+1 matching replies op f+1 matching replies C --------------------------------------------------------------------------------------- ^ ^ \ pre-P (op,n) PREPARE 2f+1 matching replies p / / p --------------------------------------------------------------------------------------- \ \ ^ ^ \ ^ ^ / \ 2f+1 matching replies / g1 -------------------------------------------------------------------------------------- \ \ / / \ / / \ / g2 -------------------------------------------------------------------------------------- \ \/ \/ \/ g3 -------------------------------------------------------------------------------------- remember our strategy: primary follows protocol => progress no progress => replicas detect and force change of primary if the primary is non-faulty, can faulty replicas prevent correct progress? they can't forge primary msgs they can delay msgs, but not forever they can do nothing: but they aren't needed for 2f+1 matching PREPAREs they can send correct PREPAREs and DoS f good replicas to prevent them from hearing ops but those replicas will eventually hear the ops from the primary worst outcome: delays if the primary is faulty, will replicas detect any problem? or can primary cause undetectable problem? primary can't forge client ops -- signed it can't ignore client ops -- client sends to all replicas it can try to send in different order to different replicas, or try to trick replicas into thinking an op has been processed even though it hasn't will replicas detect such an attack? results of the primary sending diff ops to diff replicas? case 1: all good nodes get 2f+1 matching PREPAREs did they all get the same op? yes: everyone who got 2f+1 matching PREPAREs must have gotten same op since any two sets of 2f+1 share at least one good server result: all good nodes will execute op2, client happy case 2: >= f+1 good nodes get 2f+1 matching PREPARES again, no disagreement possible result: f+1 good nodes will execute op, client happy BUT up to f good nodes don't execute can they be used to effectively roll back the op? i.e. send the write("B") to f+1, send read() to remaining f no: won't be able to find 2f+1 replicas with old state so no enough PREPAREs case 3: < f+1 good nodes get 2f+1 matching PREPAREs result: client never gets a reply result: system will stop, since f+1 stuck waiting for this op how to resume operation after faulty primary? need a view change to choose new primary (this view change only chooses primary; no notion of set of live servers) when does a replica ask for a view change? if it sees a client op but doesn't see 2f+1 matching PREPAREs after some timeout period is it OK to trigger a view change if just one replica asks? no: faulty replicas might cause constant view changes let's defer the question of how many replicas must ask for a view change who is the next primary? need to make sure faulty replicas can't always make themselves next primary view number v primary is v mod n so primary rotates among servers at most f faulty primaries in a row view change design 1 (not correct) replicas send VIEW-CHANGE requests to *new* primary new primary waits for enough view-change requests new primary announces view change w/ NEW-VIEW includes the VIEW-CHANGE requests as proof that enough replicas wanted to change views new primary starts numbering operations at last n it saw + 1 will all non-faulty replicas agree about operation numbering across view change? problem: I saw 2f+1 PREPAREs for operation n, so I executed it new primary did not, so it did not execute it thus new primary may start numbering at n, yielding two different op #n can new primary ask all replicas for set of operations they have executed? doesn't work: new primary can only wait for 2f+1 replies faulty replicas may reply, so new primary may not wait for me solution: don't execute operation until sure a new primary will hear about it add a third phase: PRE-PREPARE, PREPARE, then COMMIT only execute after commit operation protocol: client sends op to primary primary sends PRE-PREPARE(op, n) to all all send PREPARE(op, n) to all after replica receives 2f+1 matching PREPARE(op, n) send COMMIT(op, n) to all after receiving 2f+1 matching COMMIT(op, n) execute op view change: each replica sends new primary 2f+1 PREPAREs for recent ops new primary waits for 2f+1 VIEW-CHANGE requests new primary sends NEW-VIEW msg to all replicas with complete set of VIEW-CHANGE msgs list of every op for which some VIEW-CHANGE contained 2f+1 PREPAREs i.e. list of final ops from last view if a replica executes an op, will new primary will know of that op? replica only executed after receiving 2f+1 COMMITS maybe f of those were lies, from faulty replicas, who won't tell new primary but f+1 COMMITs were from replicas that got 2f+1 matching PREPAREs new primary waits for view-change requests from 2f+1 replicas ignoring the f faulty nodes f+1 sent COMMITs, f+1 sent VIEW-CHANGE must overlap can the new primary omit some of the reported recent operations? no, NEW-VIEW must include signed VIEW-CHANGE messages paper also discusses checkpoints and logs to help good nodes recover various cryptographic optimizations optimizations to reduce # of msgs in common case fast read-only operations what are the consequences of more than f corrupt servers? can the system recover? what if the client is corrupt? suppose an attacker can corrupt one of the servers exploits a bug, or steals a password, or has physical access, &c why can't the attacker corrupt them all? References: PhD thesis: https://dspace.mit.edu/bitstream/handle/1721.1/86581/48116479-MIT.pdf Proactive recovery: http://dl.acm.org/citation.cfm?id=571640> BASE: http://dl.acm.org/citation.cfm?id=859718 Stellar: https://www.stellar.org/papers/stellar-consensus-protocol.pdf Hyperledger: https://www.zurich.ibm.com/dccl/papers/cachin_dccl.pdf A funny perspective on BFT: https://www.usenix.org/system/files/login-logout_1305_mickens.pdf