6.5840 2023 Lecture 9: Consistency, Linearizability

today's topic: consistency models, specifically linearizability

what's a consistency model?
  a specification for the relationship of different clients' views of a service
  I'll focus on key/value storage with network clients
    put(k, v)
    get(k) -> v
  given some put/get calls, what outcome(s) are valid?

why might there be any room for debate about behavior?
  [simple diagrams]
  concurrent reads/writes
  replicas
  caches
  failure, recovery
  retransmission, especially across failure

why does a storage system need a consistency model?
  it's hard to write correct applications w/o knowing how storage behaves
    e.g. producer computes, then executes
      put("result", 27)
      put("done", true)
    consumer executes
      while get("done") != true
        pause
      v = get(result)
    is v guaranteed to be 27? or even have a value at all?
  it's hard to design/implement/optimize service w/o a notion of correctness
    e.g. is it OK for clients to read from GFS replicas (rather than primary)?

there are lots of consistency models
  sometimes driven by desire to simplify application programmers' lives
  and sometimes codifying behavior that was convenient for implementors
  also, lots of overlapping definitions from different fields
    e.g. FS, databases, CPU memory
  today: linearizability
  but also:
    eventual consistency
    causal consistency
    fork consistency
    serializability
    sequential consistency
  driving force: performance / convenience / robustness tradeoffs

linearizability
  it's usually what people mean by "strong consistency".
    matches programmer intuitions reasonably well.
    but it effectively rules out many tempting optimizations.
  you'll implement a linearizable key/value store in Lab 3.

starting point
  we assume that there's a serial spec for what individual operations do
  serial = a single server executing operations one at a time
  string db[]
  put(k, v):
    db[k] = v
    return true
  get(k):
    return db[k]
  no surprises here.

but what about concurrent client operations?
  a client sends a request;
    takes some time crossing the network;
    server computes, talks to replicas, &c;
    reply moves through network;
    client receives reply
  other clients may send/receive/be waiting during that time!
  we need a way to describe concurrent scenarios,
    so we can talk about which are/aren't valid

definition: a history
  describes a time-line of possibly-concurrent operations
  each operation has invocation and response times (RPC)
    as well as argument and return values
  example:
    C1: |-Wx1-| |-Wx2-|
    C2:   |---Rx2---|
  the x-axis is real time
  "Wx1" means "write value 1 to record x" -- put(x, 1)
  "Rx1" means "a read of record x yielded value 1" -- get(x) -> 1
  C1 sent put(x, 1), got reply, sent put(x, 2), got reply
  C2 get(x), got reply=2
  the "|-" indicates the time at which client sent request
  the "-|" indicates the time at which the client received the reply
  note that writes have responses; no return value other than "done"

definition: a history is linearizable if
  * you can find a point in time for each operation
    between its invocation and response, and
  * the history's results are the same as serial
    execution in that point order.

example history 1:
  |-Wx1-| |-Wx2-|
    |---Rx2---|
      |-Rx1-|

is this history linearizable?
  can we find a linearization point for each operation?
  we may need to try a few different point assignments.
  this order of points satisfies the rules:
    Wx1 Rx1 Wx2 Rx2
  1. each point lies between invocation and response.
  2. the sequence satisfies the serial put/get spec.

note: either read could have returned either 1 or 2.
  so linearizability often allows multiple different outcomes.

what can we do with the linearizability definition?
  use it to decide if a given history is legal (linearizable).
  only relevant if we have a linearizable service:
    one that aims to always generates linearizable histories.
  for programming: what can I assume / expect as a client?
  for testing: generate requests, check observed history.
  for thinking: could this optimization result in non-linearizable results?

note: the definition is based on external behavior
  so we can apply it without having to know how service works

why is it called "linearizability"?
  the linearization points turn concurrent operations
    into a serial execution -- "linear".
  thus "linearizable" in the sense that it can be converted to a
    linear series of operations.

example 2:
  |-Wx1-| |----Wx2----|
    |--Rx2--|
              |-Rx1-|
we can try a few assignments of linearization points.
  how about Wx1 Wx2 Rx2 Rx1?
  it is not valid because "Wx2 Rx1" doesn't conform to serial spec.
how to show something *isn't* linearizable?
  show that no assignment of points works.
  or that none of the serial orders that conform to the time rule
    conform to the serial execution rule.
  there are only 24 orders in this example, we could try them all.
sometimes you can find a cycle in a "must come before" graph:
  Wx1 before Wx2 (time)
  Wx2 before Rx2 (value)
  Rx2 before Rx1 (time)
  Rx1 before Wx2 (value)
there's a cycle -- so it cannot be turned into a linear order. so this
  history is not linearizable. 
and, if a system could produce this history, we would know that the
  system wasn't linearizable.
one way to look at this is that the Rx2 shows that the
  value has already been updated, so strictly subsequent client
  requests must not see any earlier write.
so, if we want linearizability:
  can't reveal multiple divergent copies to clients.
  can't un-do a revealed write, e.g. if a replica crashes.

GFS is not linearizable: it can produce the example 2 history since
  the Rx1 could come from a replica that hasn't yet been updated.
  if we wanted GFS to be linearizable, client reads would have
    to go through the primary too, and wait for outstanding
    writes to complete.
  similarly, for Lab 3, clients can only read from the leader,
    not followers.

example 3:
|--Wx0--|  |--Wx1--|
            |--Wx2--|
        |-Rx2-| |-Rx1-|
this may look non-linearizable because Wx2 looks
  like it comes after Wx1, but Rx2 comes before Rx1.
but this order shows it's linearizable: Wx0 Wx2 Rx2 Wx1 Rx1
so:
  the service can pick either order for concurrent writes.
  e.g. a key/value server handing concurrent ops to Raft.Start()
  the linear order can be different from start-time order!

example 4:
|--Wx0--|  |--Wx1--|
            |--Wx2--|
C1:     |-Rx2-| |-Rx1-|
C2:     |-Rx1-| |-Rx2-|
can there be a total order?
  C1 needs Wx2 Rx2 Wx1 Rx1
  C2 needs Wx1 Rx1 Wx2 Rx2
  we can't have both Wx2 before Wx1, and Wx2 after Wx1.
  so not linearizable.
so:
  service can choose either order for concurrent writes
  but all clients must see the writes in the same order
  this is important when there are replicas or caches
    they have to all agree on the order in which operations occur

example 5:
|-Wx1-|
        |-Wx2-|
                |-Rx1-|
can order include Wx2 Rx1?
  no: the read doesn't see the latest written value
can order include Rx1 Wx2?
  no: the order has to preserve order if one op ends before another starts
no order is possible -- not linearizable
so:
  reads must return fresh data: stale values aren't linearizable
  even if the reader doesn't know about the write
    the time rule requires reads to yield the latest data
  linearizability forbids many situations:
    split brain (two active leaders)
    forgetting committed writes after a reboot
    reading from lagging replicas

example 6:
C1 calls put(x, 1)
C2 calls put(x, 2)
service receives C1's request;
  network drops response;
  C1's RPC library re-sends request
is it legal for service to execute *both* of C1's request messages?
we then might see this if C3 reads three times:
C1: |--------Wx1---------| (due to retransmission)
C2:        |-Wx2-|
C3:   |-Rx1-| |-Rx2-|  |-Rx1-|
assume x starts out as zero
this history is not linearizable!
so, if we want linearizability:
  duplicate requests from retransmissions must be suppressed!
  the Raft paper mentions this issue (and a solution) in Section 8

example 7:
suppose the service remembers each request+response,
  detect duplicate requests,
  reply with same value as for first copy of the request
  you'll do this in Lab 3.
for writes, this eliminates the duplicate execution -- good.
but for reads, this may yield a saved value that is now out of date!
what does linearizabilty say?
C1: |-Wx3-|          |-Wx4-|
C2:          |-Rx3-------------|
C2's first request before Wx4, re-sends after Wx4
a valid order: Wx3 Rx3 Wx4
so: returning the old saved value 3 is correct
    returning 4 would also be correct

we say a storage system is linearizable if it only ever produces
  linearizable execution histories.
but the definition doesn't tell us how to build such a system.

linearizable systems are not limited to just read and write
  they can support any operation on individual items
  e.g. "mini-transactions"
    atomic read-write sequence on individual items
    test-and-set
    increment
    &c
  works b/c linearizable systems act as if executing ops serially.
  very useful!

clients like linearizability -- it's relatively easy to use:
  * reads see fresh data -- not stale
  * all clients see the same data (when there aren't writes)
  * all clients see data changes in the same order
    so my put(v,27); put(done,true); example works
  * mini-transactions are possible
  the value of these properties will be clearer when we look at
    systems with weaker consistencies.

how can we implement linearizability?
  all solutions have a significant serial component!

single serial server that doesn't crash.
  [diagram: clients, server, op queue, state]
  server picks an order for concurrently arriving client requests.
  executes them in that order, one at a time,
    replies to each before starting the next.
  but what if we want fault tolerance?

primary/backup replication
  [diagram: primary, two backups]
  all requests go to the primary
    picks a serial order
    forwards to backups
    backups execute in the same order
    primary replies to client only after both backups have executed
    so, if client saw response, backup guaranteed to have executed
      important if primary fails
      to avoid forgetting completed requests
  clients cannot send reads directly to a backup, as in GFS
  need an external party to decide when backup should take over
    to avoid split brain
    as in the VMware FT paper

Raft
  peers agree on a serial order for client operations -- the log
  all peers execute all operations in the same order

what about the performance of linearizable systems?
  bad news: if replication, then lots of communication
  bad news: if replication, all replicas must be reachable
  good news: you can shard

what about other consistency models?
  can they allow better performance?
  intuitive semantics?

example: eventual consistency -- a weak model
  multiple copies of the data (e.g. in different datacenters)
  a read consults any one replica (e.g. closest)
  a write updates any one replica (e.g. closest)
    client gets response when that one update is done
  replicas synchronize updates in the background
    eventually, other replicas will see my update

eventual consistency is pretty popular
  faster than linearizability
    especially if replicas are in different cities for fault-tolerance
  and more available -- any one replica will do
    no primary/backup communication or Raft quorum required
  Amazon's Dynamo, and Cassandra, provide eventual consistency

but eventual consistency exposes some anomalies to application programmer:
  * a read may not see the most recent write -- reads can see stale data
    a problem for password change, ACL change, my result/done example
  * writes may be applied out of order
  * different clients may see different data, different orders
  * concurrent writes to same item need to be resolved somehow!
    C1: put(x, 1)
    C2: put(x, 2)
    may initially be applied at different replicas
    only later will they be pushed to other replicas
    how to ensure all replicas choose the same final value?
      so that, eventually, they are identical?
  * eventual consistency cannot support e.g. test-and-set

A general pattern: you can usually choose only one of these:
  Strong consistency
  Maximum availability

But not both.
  Strong consistency makes you wait to update replicas,
    and can't proceed if too many replicas are unavailable.
    Thus poor availability.
  Eventual consistency can proceed even if no other replicas are reachable.
    But has poor consistency.