6.824 2022 Lecture 3: GFS

The Google File System
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
SOSP 2003

Why are we reading this paper?
  distributed storage is a key abstraction
    what should the interface/semantics look like?
    how should it work internally?
  GFS paper touches on many themes of 6.824
    parallel performance, fault tolerance, replication, consistency
  good systems paper -- details from apps all the way to network
  successful real-world design
  influential: Hadoop's HDFS

Why is distributed storage hard?
  high performance -> shard data over many servers
  many servers -> constant faults
  fault tolerance -> replication
  replication -> potential inconsistencies
  better consistency -> low performance

What would we like for consistency?
  Ideal model: same behavior as a single server
  ideal server executes client operations one at a time
    even if multiple clients issued operations concurrently
  reads reflect previous writes
    even if server crashes and restarts
  all clients see the same data
  thus:
    suppose C1 and C2 write concurrently, and after the writes have
      completed, C3 and C4 read. what can they see?
    C1: Wx1
    C2: Wx2
    C3:     Rx?
    C4:         Rx?
    answer: either 1 or 2, but both have to see the same value.
  This is a "strong" consistency model.
  But a single server has poor fault-tolerance.

Replication for fault-tolerance makes strong consistency tricky.
  a simple but broken replication scheme:
    two replica servers, S1 and S2
    clients send writes to both, in parallel
    clients send reads to either
  in our example, C1's and C2's write messages could arrive in
    different orders at the two replicas
    if C3 reads S1, it might see x=1
    if C4 reads S2, it might see x=2
  or what if S1 receives a write, but 
    the client crashes before sending the write to S2?
  that's not strong consistency!
  better consistency usually requires communication to
    ensure the replicas stay in sync -- can be slow!
  lots of tradeoffs possible between performance and consistency
    we'll see one today

GFS

Context:
  Many Google services needed a big fast unified storage system
    Mapreduce, crawler, indexer, log storage/analysis
  Shared among multiple applications e.g. crawl, index, analyze
  Automatic "sharding" of each file over many servers/disks
    For parallel performance with many clients e.g. MapReduce
    For huge files too big for one server
  Automatic recovery from failures
  Just one data center per deployment
  Just Google applications/users
  Aimed at sequential access to huge files; read or append
    I.e. not a low-latency DB for small items

What was new about this in 2003? How did they get an SOSP paper accepted?
  Not the basic ideas of distribution, sharding, fault-tolerance.
  Huge scale.
  Used in industry, real-world experience.
  Successful use of weak consistency.

Overall structure
  clients (library, RPC -- but not visible as a UNIX FS)
  coordinator tracks file names
  chunkservers store 64 MB chunks
  big files split into 64 MB chunks, on lots of chunkservers
    big chunks -> low book-keeping overhead
  each chunk replicated on 3 chunkservers
    why 3 rather than 2?

Coordinator state
  tables in RAM (for speed, must be smallish):
    file name -> array of chunk handles (nv)
    chunk handle -> version # (nv)
                    list of chunkservers (v)
                    primary (v)
                    lease time (v)
  "nv" state also written to disk, in case crash+reboot

What are the steps when client C wants to read a file?
  1. C sends filename and offset to coordinator (CO) (if not cached)
  2. CO finds chunk handle for that offset
  3. CO replies with list of chunkhandles + chunkservers
     only those with latest version
  4. C caches handle + chunkserver list
  5. C sends request to nearest chunkserver
     chunk handle, offset
  6. chunk server reads from chunk file on disk, returns to client

Clients only ask coordinator about file names and lists of chunk handles
  clients cache name -> chunkhandle info
  coordinator does not handle data, so (hopefully) not heavily loaded

How does the coordinator know what chunkservers have a given chunk?

What are the steps when C wants to do a "record append"?
  paper's Figure 2
  1. C asks CO about file's last chunk
  2. CO tells C the primary and secondaries
  3. C sends data to all (just temporary...), waits for all replies (?)
  4. C tells P to append
  5. P checks that lease hasn't expired, and chunk has space
  6. P picks an offset (at end of chunk)
  7. P writes its own chunk file (a Linux file)
  8. P tells each secondary the offset, tells to append to chunk file
  9. P waits for all secondaries to reply, or timeout
     secondary can reply "error" e.g. out of disk space
  10. P tells C "ok" or "error"
  11. C retries from start if error

What consistency guarantees does GFS provide to clients?
  Needs to be in a form that tells applications how to use GFS.

Here's a possibility:

  If the primary tells a client that a record append succeeded, then
  any reader that subsequently opens the file and scans it will see
  the appended record somewhere.

That allows lots of anomalies: different clients may see records in
different orders; they may see duplicated records; they may see record
from failed writes. GFS application must be prepared!

How can we think about how GFS fulfils the guarantee?
  Look at its handling of various failures:
    crash, crash+reboot, crash+replacement, message loss, partition.
  Ask how GFS ensures guarantee for each kind of failure.

* What if the client fails during record append sequence?

* What if the appending client has cached a stale (wrong) primary for a chunk?

* What if the reading client has cached a stale server list for a chunk?

* Could a coordinator crash+reboot cause it to forget about the file?
  Or forget what chunkservers hold the relevant chunk?

* Two clients do record append at exactly the same time.
  Will they overwrite each others' records?

* Suppose one secondary doesn't hear the append command from the primary.
  Due to a temporary network failure.
  What if reading client reads from that secondary?

* What if primary S1 is alive and serving client requests,
    but network between coordinator and S1 fails?
  "network partition"
  Will the coordinator pick a new primary?
  Will there now be two primaries?
  So that the append goes to one primary, and the read to the other?
    Thus breaking the consistency guarantee?
    "split brain"

* What if the primary crashes before sending append to all secondaries?
  Could a secondary that *didn't* see the append be chosen as the new primary?

* Chunkserver S4 with an old stale copy of chunk is offline.
  Primary and all live secondaries crash.
  S4 comes back to life (before primary and secondaries).
  Will coordinator choose S4 (with stale chunk) as primary?
  Better to have primary with ancient data, or no replicas at all?

* How does the coordinator set up a primary for a chunk?
  If client wants to write, but no primary, or primary lease expired and dead.
  Coordinator has been polling chunkservesr about what chunks/versions they have.
  a. if no chunkservers w/ latest version #, error
  b. pick primary P and secondaries from those w/ latest version #
  c. increment version #, write to disk
  d. tell P and secondaries who they are, and new version #
  e. replicas write new version # to disk

* What should a primary do if a secondary always fails writes?
  e.g. dead, or out of disk space, or disk has broken.
  Keep failing client requests?
  Or ask coordinator to declare a new set of servers and new version?
    The paper does not describe this process.

* If there's a partitioned primary serving client appends, and its
  lease expires, and the coordinator picks a new primary, will the new
  primary have the latest data as updated by partitioned primary?

* What if the coordinator fails altogether.
  Will the replacement know everything the dead coordinator knew?
  E.g. each chunk's version number? primary? lease expiry time?

* Who/what decides the coordinator is dead, and must be replaced?
  Could the coordinator replicas ping the coordinator, take over if no response?

* What happens if the entire building suffers a power failure?
  And then power is restored, and all servers reboot.

* Suppose the coordinator wants to create a new chunk replica.
  Maybe because too few replicas.
  Suppose it's the last chunk in the file, and being appended to.
  How does the new replica ensure it doesn't miss any appends?
    After all it is not yet one of the secondaries.

* Is there *any* circumstance in which GFS will break the guarantee?
  i.e. append succeeds, but subsequent readers don't see the record.
  All coordinator replicas permanently lose state (permanent disk failure).
    Could be worse: result will be "no answer", not "incorrect data".
    "fail-stop"
  All chunkservers holding the chunk permanently lose disk content.
    again, fail-stop; not the worse possible outcome
  CPU, RAM, network, or disk yields an incorrect value.
    checksum catches some cases, but not all
  Time is not properly synchronized, so leases don't work out.
    So multiple primaries, maybe write goes to one, read to the other.

What would it take to have no anomalies -- strict consistency?
  I.e. all clients see the same file content.
  Too hard to give a real answer, but here are some issues.
  * All replicas should complete each write, or none.
    Perhaps tentative writes until all promise to complete it?
    Don't expose writes until all have agreed to perform them!
  * Primary should detect duplicate client write requests.
  * If primary crashes, some replicas may be missing the last few ops.
    New primary must talk to all replicas to find all recent ops,
    and sync with secondaries.
  * Clients must be prevented from reading from stale ex-secondaries;
    perhaps secondaries have leases, or clients know about chunk versions
    and get a lease on that version from coordinator.
  You'll see solutions in Labs 2 and 3!

Performance (Figure 3)
  large aggregate throughput for read
    94 MB/sec total for 16 clients
      or 6 MB/second per client
      is that good?
      one disk sequential throughput was about 30 MB/s
      one NIC was about 10 MB/s
    Close to saturating inter-switch link's 125 MB/sec
    So: per-client performance is not huge
        but multi-client scalability is good
        which is more important?
    Table 3 reports 500 MB/sec for production GFS, which is a lot
  writes to different files lower than possible maximum
    authors blame their network stack (but no detail)
  concurrent appends to single file
    limited by the server that stores last chunk
  hard to interpret after 15 years, e.g. how fast were the disks?

Random issues worth considering
  What would it take to support small files well?
  What would it take to support billions of files?
  Could GFS be used as wide-area file system?
    With replicas in different cities?
    All replicas in one datacenter is not very fault tolerant!
  How long does GFS take to recover from a failure?
    Of a chunkserver?
    Of the coordinator?
  How well does GFS cope with slow chunkservers?

Retrospective interview with GFS engineer:
  http://queue.acm.org/detail.cfm?id=1594206
  file count was the biggest problem
    eventual numbers grew to 1000x those in Table 2 !
    hard to fit in coordinator RAM
    coordinator scanning of all files/chunks for GC is slow
  1000s of clients too much CPU load on coordinator
  applications had to be designed to cope with GFS semantics
    and limitations
  coordinator fail-over initially entirely manual, 10s of minutes
  BigTable is one answer to many-small-files problem
  and Colossus apparently shards coordinator data over many coordinators

Summary
  case study of performance, fault-tolerance, consistency
    specialized for MapReduce applications
  good ideas:
    global cluster file system as universal infrastructure
    separation of naming (coordinator) from storage (chunkserver)
    sharding for parallel throughput
    huge files/chunks to reduce overheads
    primary to sequence writes
    leases to prevent split-brain chunkserver primaries
  not so great:
    single coordinator performance
      ran out of RAM and CPU
    chunkservers not very efficient for small files
    lack of automatic fail-over to coordinator replica
    maybe consistency was too relaxed