6.824 2017 Lecture 3: GFS Case Study

The Google File System
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
SOSP 2003

Why are we reading this paper?
  the file system for map/reduce
  main themes of 6.824 show up in this paper
    trading consistency for simplicity and performance
    motivation for subsequent designs
  good systems paper -- details from apps all the way to network
    performance, fault-tolerance, consistency
  influential
    many other systems use GFS (e.g., bigtable)
    HDFS (Hadoop's Distributed File Systems) based on GFS
    
What is consistency?
  A correctness condition
  Important when data is replicated and concurrently accessed by applications 
    if an application performs a write, what will a later read observe?
      what if the read is from a different application?
  Weak consistency
    read() may return stale data  --- not the result of the most recent write
  Strong consistency
    read() always returns the data from the most recent write()
  General trade-off:
    strong consistency is nice for application writers
    strong consistency is bad for performance
  Many correctness conditions (often called consistency models)
    Today first peak; will show up in almost every paper we read this term
    
"Ideal" consistency model
  A replicated files behaves like as a non-replicated file system
    picture: many clients on the same machine accessing files on a single disk
  If one application writes, later reads will observe that write
  What if two application concurrently write to the same file
    In file systems often undefined  --- file may have some mixed content
  What if two application concurrently write to the same directory
    One goes first, the other goes second

Challenges to achieving ideal consistency
  Concurrency
  Machine failures
  Network partitions

Why are these challenges difficult to overcome:
  Requires communication between clients and servers
    May cost performance
  Protocols can become complex --- see next week
    Difficult to implement system correctly
  Many systems in 6.824 don't provide ideal
    GFS is one example

Central challenge in GFS:
  With so many machines failures are common
    assume a machine fails once per year
    w/ 1000 machines, ~3 will fail per day.
  High-performance: many concurrent readers and writers
    Map/Reduce jobs read and store final result in GFS
    Note: *not* the temporary, intermediate files
  Use network efficiently
  These challenges difficult combine with "ideal" consistency

High-level design
  Directories, files, names, open/read/write
    But not POSIX
  100s of Linux chunk servers with disks
    store 64MB chunks (an ordinary Linux file for each chunk)
    each chunk replicated on three servers
    Q: why 3x replication?
    Q: Besides availability of data, what does 3x replication give us?
       load balancing for reads to hot files
       affinity
    Q: why not just store one copy of each file on a RAID'd disk?
       RAID isn't commodity
       Want fault-tolerance for whole machine; not just storage device
    Q: why are the chunks so big?
  GFS master server knows directory hierarchy
    for dir, what files are in it
    for file, knows chunk servers for each 64 MB
    master keeps state in memory
      64 bytes of metadata per each chunk
    master has private recoverable database for metadata
      master can recovery quickly from power failure
    shadow masters that lag a little behind master
      can be promoted to master

Basic file operations:
  Client read:
    send file name and offset to master
    master replies with set of servers that have that chunk
      response includes version # of chunk
      clients cache that information
    ask nearest chunk server
      checks version #
      if version # is wrong, re-contact master
  Client append
    ask master where to store
      maybe master chooses a new set of chunk servers if crossing 64 MB
      master responds with chunk servers and version #
        one chunk server is primary
    Clients pushes data to replicas
      Replicas form a chain
      Chain respects network topology
      Allows fast replication
    Client contacts primary when data is on all chunk servers
      primary assigns sequence number
      primary applies change locally
      primary forwards request to replicas
      primary responds to client after receiving acks from all replicas
    If one replica doesn't respond, client retries
      After contacting master
  Master can appoint new master if master doesn't refresh lease
  Master replicates chunks if number replicas drop below some number
  Master rebalances replicas

Does GFS achieve "ideal" consistency?
  Two cases: directories and files
  Directories: yes, but...
    Yes: strong consistency (only one copy)
    But:
      master may go down and GFS is unavailable
        shadow master can serve read-only operations, which may return stale data
        Q: Why not write operations?
	  split-brain syndrome (see next lecture)
  Files: not always
    Mutations with atomic appends
      A file can have duplicate entries and holes
        if primary fails to contact a replica, the primary reports an error to client
        client retries and primary picks a new offset
	record can be duplicated at two offsets
	while other replicas may have a hole at one offset
    An "unlucky" client can read stale data for short period of time
      A failed mutation leaves chunks inconsistent
        The primary chunk server updated chunk
        But then failed and the replicas are out of date
      A client may read an not-up-to-date chunk
      When client refreshes lease it will learn about new version #
    Mutations without atomic append
      data of several clients maybe intermingled
      concurrent writes on non-replicated Unix can also result in a strange outcome
      if you are, use atomic append or a temporary file and atomically rename

Authors claims weak consistency is not a big problems for apps    
  Most file updates are append-only updates
    Application can use UID in append records to detect duplicates
    Application may just read less data (but not stale data)
  Application can use temporary files and atomic rename
    
Performance (Figure 3)
  huge aggregate throughput for read (3 copies, striping)
    125 MB/sec in aggregate
    Close to saturating network
  writes to different files lower than possible maximum
    authors blame their network stack
    it causes delays in propagating chunks from one replica to next
  concurrent appends to single file
    limited by the server that stores last chunk
    
Summary
  case study of performance, fault-tolerance, consistency
    specialized for MapReduce applications
  what works well in GFS?
    huge sequential reads and writes
    appends
    huge throughput (3 copies, striping)
    fault tolerance of data (3 copies)
  what less well in GFS?
    fault-tolerance of master
    small files (master a bottleneck)
    clients may see stale data
    appends maybe duplicated
    
References
  http://queue.acm.org/detail.cfm?id=1594206  (discussion of gfs evolution)
  http://highscalability.com/blog/2010/9/11/googles-colossus-makes-search-real-time-by-dumping-mapreduce.html