6.824 2017 Lecture 21:  Security: Byzantine Fault Tolerance

Paper: "Practical byzantine fault tolerance" by Castro and Liskov (OSDI'99)

why are we reading this paper?
  impressive result
    solves a strictly harder problem than Raft
    namely state-machine replication in the presence of malicious replicas
  many follow-on papers
  BFT is not widely-used today
    people rely on prevention and detection of compromised nodes
  BFT is seeing a come back for bitcoin-like systems
    Bitcoin solves consensus with malicious participants
      But, proof-of-work and long delays to resolve forks
    Stellar generalizes PBFT for federated deployments
    IBM's hyperledger uses PBFT

we've considered many fault-tolerance protocols
  have always assumed "fail-stop" failures -- like power failure
  i.e. servers follow the protocol
  hard enough: crash vs network down; network partition

can one handle a larger class of failures?
  buggy servers, that compute incorrectly rather than stopping?
  servers that *don't* follow the protocol?
  servers that have been modified by an attacker?
  often called "Byzantine" faults

the paper's approach:
  replicated state machine
  assumes 2f+1 of 3f+1 are non-faulty
  use voting to select the right results
  not as easy as it might sound

let's assume the worst case:
  a single attacker controls the f faulty replicas
  and is actively trying to break the system
  if we can handle this, we can handle bugs in f replicas too

what are the attacker's powers?
  supplies the code that faulty replicas run
  knows the code the non-faulty replicas are running
  knows the faulty replicas' crypto keys
  can read network messages
  can temporarily force messages to be delayed via DoS

what faults *can't* happen?
  no more than f out of 3f+1 replicas can be faulty
  no client failure -- clients never do anything bad
  no guessing of crypto keys or breaking of cryptography

example use scenario:
  RM:
    echo A > grade
    echo B > grade
    tell FK "the grade file is ready"
  FK:
    cat grade

a faulty system could:
  totally make up the file contents
  execute write("A") but ignore write("B")
  show "B" to RM and "A" to FK
  execute write("B") only only some of the replicas

let's try to design our own byzantine-fault-tolerant RSM
  start simple (and broken), work towards paper's design

design 1:
  [client, n servers]
  n servers
    servers don't have the same exploits
    different OSes, different service implementations, etc. 
  client and servers have public-key pairs
    every message is signed by sending
    recipients authenticate a message
  client sends request to all servers
  waits for all n to reply
    authenticates replies
  only proceeds if all n agree
    
what's wrong with design 1?
  one server may have been compromised
  that one server can stop progress by disagreeing

design 2:
  let's have replicas vote
  2f+1 servers, assume no more than f are faulty
  client waits for f+1 matching replies
    if only f are faulty, and network works eventually, must get them!

what's wrong with design 2's 2f+1?
  f+1 matching replies might be f bad nodes and just 1 good
    so maybe only one good node got the operation!
  *next* operation also waits for f+1
    might *not* include that one good node that saw op1
  example: (see drawing below)
    f g1 g2 (f is faulty/bad)
    everyone hears and replies to write("A")
    f and g1 reply to write("B"), but g2 misses it
      client can't wait for g2 since it may be the one faulty server
    f and g2 reply to read(), but g1 misses it
    so read() yields "A"
  result: client tricked into accepting a reply based on out-of-date state
    e.g. TA reads A instead of B from grades file
    e.g. Breaks correctness; it should provide linearizability (return B)
    
       Put k=A       Put k=B     Get k   ->   A   (f lies, sending A too)
    C --------------------------------------------------
           ^ ^ ^      \   ^  ^    \   ^     ^
        \ / / /        \/  /       \ /     /
    f --------------------------------------------------
         \/  /           \/         \    /
    g1 --------------------------------------------------
           \/                         \/
    g2 --------------------------------------------------
            A                         

    ==> Need to to ensure that replies include majority of good nodes, so that
    at least one will return B later

design 3:
  3f+1 servers, of which at most f are faulty
  client waits for 2f+1 matching replies
    == f bad nodes plus a majority of the good nodes
    so all sets of 2f+1 overlap in at least one good node
  example (see below):
    f g1 g2 g2 (f is faulty/bad)
    everyone hears write("A")
    f, g1, g2 process write("B"), g3 misses it
    now the read()
      client will wait for 2f+1=3 matching replies
      f and g3 will reply "A"
      g1 and g2 will reply "B"
    client doesn't know what to believe (neither is 2f+1)
      but it is guaranteed to see there's a problem
  so client can *detect* that some good nodes missed an operation
    we'll see how to repair in a bit

       Put k=A        Put k=B       Get k: A  B    B  A 
    C --------------------------------------------------
           ^ ^ ^  ^     \  ^  ^ ^     \   ^  ^    ^ ^
        \ / / /  /       \/  / /       \ /   /   / /
    f --------------------------------------------------
         \/  /  /         \/  /          \ /   / /
    g1 --------------------------------------------------
           \/  /           \/              \/  /
    g2 --------------------------------------------------
            \/                              \/
    g3 --------------------------------------------------

 
what about handling multiple clients?
  non-faulty replicas must process operations in the same order!

let's have a primary to pick order for concurrent client requests
  but we have to worry about a faulty primary

what can a faulty primary do?
  1. send wrong result to client
  2. different ops to different replicas
  3. ignore a client op

general approach to handling faulty primary
  1. replicas send results direct to client
  2. replicas exchange info about ops sent by primary
  3. clients notify replicas of each operation, as well as primary
     each replica watches progress of each operation
     if no progress, force change of primary

can a replica execute an operation when it first receives it from primary?
  no: maybe primary gave different ops to different replicas
  if we execute before we're sure, we've wrecked the replica's state
  need 2nd round of messages to make sure all good replicas got the same op

design 4 (with primary):
  3f+1 servers, one is primary, f faulty, primary might be faulty
  client sends request to primary AND to each replica
  primary chooses next op and op #
  primary sends PRE-PREPARE(op, n) to replicas
  each replica sends PREPARE(op, n) to all replicas
  if replica gets matching PREPARE(op, n) from 2f+1 replicas (incl itself)
    and n is the next operation #
    execute the operation, possibly modifying state
    send reply to client
  else:
    keep waiting
  client is happy when it gets f+1 matching replies

    op                                                                  f+1  matching replies
    C ---------------------------------------------------------------------------------------
		                                               ^      ^
        \ pre-P (op,n)      PREPARE   2f+1 matching replies p /      /
    p ---------------------------------------------------------------------------------------
         \       \    ^ ^    \   ^ ^   / \  2f+1 matching replies  /
    g1 --------------------------------------------------------------------------------------
           \       \ / /      \ / /    \  /
    g2 --------------------------------------------------------------------------------------
            \       \/         \/       \/
    g3 --------------------------------------------------------------------------------------


remember our strategy:
  primary follows protocol => progress
  no progress => replicas detect and force change of primary

if the primary is non-faulty, can faulty replicas prevent correct progress?
  they can't forge primary msgs
  they can delay msgs, but not forever
  they can do nothing: but they aren't needed for 2f+1 matching PREPAREs
  they can send correct PREPAREs
    and DoS f good replicas to prevent them from hearing ops
    but those replicas will eventually hear the ops from the primary
  worst outcome: delays

if the primary is faulty, will replicas detect any problem?
  or can primary cause undetectable problem?
  primary can't forge client ops -- signed
  it can't ignore client ops -- client sends to all replicas
  it can try to send in different order to different replicas,
    or try to trick replicas into thinking an op has been
    processed even though it hasn't
  will replicas detect such an attack?

results of the primary sending diff ops to diff replicas?
  case 1: all good nodes get 2f+1 matching PREPAREs
    did they all get the same op?
    yes: everyone who got 2f+1 matching PREPAREs must have gotten same op
      since any two sets of 2f+1 share at least one good server
    result: all good nodes will execute op2, client happy
  case 2: >= f+1 good nodes get 2f+1 matching PREPARES
    again, no disagreement possible
    result: f+1 good nodes will execute op, client happy
    BUT up to f good nodes don't execute
      can they be used to effectively roll back the op?
      i.e. send the write("B") to f+1, send read() to remaining f
      no: won't be able to find 2f+1 replicas with old state
      so no enough PREPAREs
  case 3: < f+1 good nodes get 2f+1 matching PREPAREs
    result: client never gets a reply
    result: system will stop, since f+1 stuck waiting for this op

how to resume operation after faulty primary?
  need a view change to choose new primary
  (this view change only chooses primary; no notion of set of live servers)

when does a replica ask for a view change?
  if it sees a client op but doesn't see 2f+1 matching PREPAREs
  after some timeout period

is it OK to trigger a view change if just one replica asks?
  no: faulty replicas might cause constant view changes

let's defer the question of how many replicas must ask for
  a view change
  
who is the next primary?
  need to make sure faulty replicas can't always make themselves next primary
  view number v
  primary is v mod n
  so primary rotates among servers
  at most f faulty primaries in a row

view change design 1 (not correct)
  replicas send VIEW-CHANGE requests to *new* primary
  new primary waits for enough view-change requests
  new primary announces view change w/ NEW-VIEW
    includes the VIEW-CHANGE requests
    as proof that enough replicas wanted to change views
  new primary starts numbering operations at last n it saw + 1

will all non-faulty replicas agree about operation numbering across view change?

problem:
  I saw 2f+1 PREPAREs for operation n, so I executed it
  new primary did not, so it did not execute it
  thus new primary may start numbering at n, yielding two different op #n

can new primary ask all replicas for set of operations they have executed?
  doesn't work: new primary can only wait for 2f+1 replies
    faulty replicas may reply, so new primary may not wait for me

solution:
  don't execute operation until sure a new primary will hear about it
  add a third phase: PRE-PREPARE, PREPARE, then COMMIT
  only execute after commit

operation protocol:
  client sends op to primary
  primary sends PRE-PREPARE(op, n) to all
  all send PREPARE(op, n) to all
  after replica receives 2f+1 matching PREPARE(op, n)
    send COMMIT(op, n) to all
  after receiving 2f+1 matching COMMIT(op, n)
    execute op

view change:
  each replica sends new primary 2f+1 PREPAREs for recent ops
  new primary waits for 2f+1 VIEW-CHANGE requests
  new primary sends NEW-VIEW msg to all replicas with
    complete set of VIEW-CHANGE msgs
    list of every op for which some VIEW-CHANGE contained 2f+1 PREPAREs
    i.e. list of final ops from last view

if a replica executes an op, will new primary will know of that op?
  replica only executed after receiving 2f+1 COMMITS
  maybe f of those were lies, from faulty replicas, who won't tell new primary
  but f+1 COMMITs were from replicas that got 2f+1 matching PREPAREs
  new primary waits for view-change requests from 2f+1 replicas
    ignoring the f faulty nodes
    f+1 sent COMMITs, f+1 sent VIEW-CHANGE
    must overlap

can the new primary omit some of the reported recent operations?
  no, NEW-VIEW must include signed VIEW-CHANGE messages

paper also discusses
  checkpoints and logs to help good nodes recover
  various cryptographic optimizations
  optimizations to reduce # of msgs in common case
  fast read-only operations

what are the consequences of more than f corrupt servers?
  can the system recover?

what if the client is corrupt?

suppose an attacker can corrupt one of the servers
  exploits a bug, or steals a password, or has physical access, &c
  why can't the attacker corrupt them all?

References:
  PhD thesis: https://dspace.mit.edu/bitstream/handle/1721.1/86581/48116479-MIT.pdf
  Proactive recovery:  http://dl.acm.org/citation.cfm?id=571640>
  BASE: http://dl.acm.org/citation.cfm?id=859718
  Stellar: https://www.stellar.org/papers/stellar-consensus-protocol.pdf
  Hyperledger: https://www.zurich.ibm.com/dccl/papers/cachin_dccl.pdf
  A funny perspective on BFT: https://www.usenix.org/system/files/login-logout_1305_mickens.pdf