2017 Lecture 7: Spinnaker Case Study

Reading: "Using Paxos to build a scalable, consistent and hihgly available
datastore", by Rao, Shekita, Tata, Proc. VLDB 37, 2011

Why are we reading this paper?
 Case study of using Paxos to build a K/V server      
   Same goal as lab 3, except substitute Paxos/Raft
   Authors use Paxos in a way that looks a lot like Raft
 One of first papers to Paxos/Raft for data replication
   Instead of configuration service a la Chubby/Zookeeper
   
Several tricky issues
  strong consistency
  exactly-once semantics
  snapshot to shorten the log
  scaling performance

K/V service
  Put(Key, Value)
  Get(key) -> Value
  Spinnaker's interface is more complicated

Setup
  Many clients, issuing Put/Gets RPCs
  3 servers
  Ignore Spinnaker's sharding for now

Goal
  strong consistency
    behave as an unreplicated system
    Get reads value of last completed Put
  highly available
    survive one failure
    use 3 to handle split brain

Consistency
  behave as an unreplicated system --- what does that mean?
    what if a Put and Get are executed concurrently on a single-machine K/V service?
    what must the outcome be?
  several definitions for strong consistency
  lab 3 chooses linearizable

Linearizability
  A history h of operations is linearizable if there is a linear order of the
  completed operations such that:
    - For every completed operation in h, the operation returns the same result in
      the execution as the operation would return if every operation was completed
      one by one in order h.
    - If an operation op1 completes (gets a response) before op2 begins (invokes),
      then op1 precedes op2 in h.
  Another way of saying the above is:
     - h's invocations and responses can be reordered to yield a sequential history;
     - that sequential history is correct according to the sequential definition of the object;
     - if a response preceded an invocation in the original history, it must
       still precede it in the sequential reordering.   
  By example:
    If Put/Get start concurrently, either ordering is fine
    If Put() finished before Get starts, Get() must see result of Put()

Sequential consistency
  Another popular definitions for strong consistency
    All Puts are in total order
    All Puts/Gets must be consistent with client-order
  Doesn't guarantee that Get sees result of Put if Put completed in real-time before Get
    C1: Put(k1, v0)  Put(k1, v1)
    C2:                           Get(k1)-> v0
    time ---------->------------>-------------->----
  Above sequence is sequentially consistent by not linearizable

Time-line consistency
  Put()s in total order
  Get() can return the result of any Put()
  ==> Get() can return stale data
  But, any cohort can respond
    higher availability

Programmability vs. Availability
  Strong consistency
    Easier to program
    Lower availability
  Weaker consistency
    More difficul tto program
    Higher availability
  Labs go for strong consistency (linearizable)

Approach to consistency: exploit replicated log
  Raft provides us with consistent, replicated log
    [Modify servers to have a log]
  Good building block but insufficient for a K/V service
    How to use the replicated log to build K/V service?

Plan 1 for Put (broken)
  Plan:
    Client sends Put to leader
    Leader calls Start()
    Performs Put()
    Sends a response back to client
  Good: leader orders Puts
  Bad: leader responds too soon
    Client thinks RPC succeeded
    But server may fail right after sending response
    => Put never happens
  Must wait until Put() shows up in log

Plan 2 for Put (slow and incomplete)
  Clients all contact leader
  Leader inserts all Puts/Gets in log
  When leader sees a committed Get(k) in log, it returns the value of the last Put(k) in the log to the client
    Log is a linear history
    Puts that completed before Get started are in the log before Get
      Get see the results of those Puts
  Bad: slow, need to scan log
    Would be nice if we could keep a K/V map
    Put updates the K/V map
    Get reads from the K/V
  Bad: what if leader fails before entering Op in Log?
    Client never receives response.
  
Plan 3 for Put (broken)
  Plan
    Client sends Put to leader
    Leader calls Start()
      Operations contains Put, args, and client ID
      Raft replicates operation
    Put shows up on applyCh when committed to log
    All servers performs Put()s in the log order
      Put is deterministic
      All puts applied in the same global order
      K/V must be consistent across servers
    Leader sends a response back to client
      as soon it applies the client's Put
      client ID is in operation
  Good: fast
  Bad: no retry

Plan 4 for Put (correct, retry with duplicate detection)
   Client keeps retrying until response from service
     If failure, try other server
     Add seqno to client requests
     When Put() is successful, increase seqno
   Server maintain duplicate detection table
     When server executes op from log:
       if !duplicate[clientid] = seqno:
          res = perform Put on K/V
          duplicate[clientid] = res
       if leader:
          send duplicate[clientid] to client
   
Plan 1 for Get (broken)
  Plan:
    Client sends Get to any server
    Server performs Get and replies
  Good: high throughput
    Any server can process Get
  Bad: server sends stale value
    Primary committed new value to log before client started Get()
    But follower hasn't received log entry yet (maybe partitioned)
    Follower will return stale value
    Not linearizable
  Get must see all committed operations
    Spinnaker returns stale value if client specifies time-line consistency

Plan 2 for Get (broken)
  Plan:
    Client sends Get to primary
    Primary performs Get and replies
  Good: Primary knows what ops have been committed to log
  Bad: Primary can return stale value anyway
    Primary is old primary, perhaps partitioned
    Another client may have contacted new primary and performed Put
    Old primary returns stale value
    Could not happen in unreplicated system
    
Plan 3 for Get
  Simple solution:
    Run Get() operations also through the log, like Put()s
    Lab 3 does this
  Raft paper
    Don't run Get() through log, but make sure leader has last committed operation
    New primary adds a null entry after becoming primary
    Primary pings followers to see if it is still the primary before processing Get()

Challenge: log grows without bound
  Need a way to shorten log
  Idea: snapshots
    Snapshot K/V table when log reaches max size
    Remember last log sequence number (LSN) included in snapshot
    Remove all operations through LSN
  Why OK?
    if all servers have a consistent snapshot
       snapshot includes all ops through the same log entry
    then on recovery:
       load last snapshot in memory
       apply new operations in log to snapshot
  Challenge: servers may not have last snapshot
     Raft handles it for us
     
Engineering challenge: making snapshots
  K/V lives above Raft
  Log lives in Raft
  Solution:
    K/V makes shapshots
    K/V tells Raft about shapshot, including LSN for snapshot
    Raft tells K/V to install snapshots on applyCh
  Why do communicate snapshots over applyCh?
  Where do they appear on applyCh?
  Must snapshot include duplicate-detection table?

Engineering challenge: coordination between K/V server and Raft library
  K/V server must have locks to protect its state
  Raft library has locks to protect its state
  Upcalls delivered through applyCh
  Easy to get deadlocks. For example on primary:
    Raft goroutine holds Raft lock
      E.g., received response from follower
    Sends on applyCh while holding lock
      the follower indicated a new log entry has been committed
      holds lock because log entries must be inserted in order in applyCh
    Goroutine blocks because  applyCh is full
    Goroutine in K/V layer calls Raft library for Start(), or Snapshot()
      library already holds lock, so K/V goroutine blocks
    No Goroutine reads from applyCh
  Need to have a plan to manage concurrency

One possible plan:
  Always acquire locks in specific order
     E.g., K/V lock, then Raft lock
     How to deal with upcalls? (calls from Raft to K/V)
  Raft:
    RPC reply handlers never write to applyCh
    Separate goroutine for writing to applyCh
      release raft lock before writing to applyCh
      order will be ok, because only one goroutine that sends on applyCh
  K/V layer
    One goroutine that reads from applyCh and applies Ops on K/V
    Release K/V lock before reading from applyCh

Performance challenge: performance limited by replicated log
  All Put operations must go through leader
  Each Put operations requires chatter with followers to replicate
  Idea: shard K/V
    A raft instance per shard
    Operations on different shards can go in parallel
    => increases throughput of K/V service
  Implementation using a master
    Master to coordinate shards
    Stores mapping from keys to shards
    Load balances keys across shards (lab 4)
    Master implemented as its own Raft instance

Performance challenge for Spinnaker
  Several cohorts share a single disk
  Seeks are slow on magnetic disks
  Solution: share log between different cohorts
  Complications:
    Logical truncation, etc.

Summary
  Replicated log useful building block
  But insufficient for building replicated services
  Need solutions for:
   ensuring strong consistency
   ensuring exactly-once semantics
   truncating the log
   sharding applications tate


References
  http://www.vldb.org/2011/files/slides/research15/rSession15-1.ppt