6.824 2014 Lecture 3: GFS Case Study

The Google File System
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
SOSP 2003

Why are we reading this paper?
  the file system for map/reduce
  case study of handling storage failures
    trading consistency for simplicity and performance
    motivation for subsequent designs
  good performance -- great parallel I/O performance
  good systems paper -- details from apps all the way to network
  all main themes of 6.824 show up in this paper
    performance, fault-tolerance, consistency

What is consistency?
  A correctness condition
  Important when data is replicated and concurrently accessed by applications 
    if an application performs a write, what will a later read observe?
      what if the read is from a different application?
  Weak consistency
    read() may return stale data  --- not the result of the most recent write
  Strong consistency
    read() always returns the data from the most recent write()
  General trade-off:
    strong consistency is nice for application writers
    strong consistency is bad for performance
  Many correctness conditions (often called consistency models)

History of consistency models
  Much independent development in architecture, systems, and database communities
    Concurrent processors with private caches accessing a shared memory
    Concurrent clients accessing a distributed file system
    Concurrent transactions on distributed database
  Many different models with different trade-offs
    serializability
    sequential consistency
    linearizability
    entry consistency
    release consistency
    ....
  Today first peak; will show up in almost every paper we read this term
    
"Ideal" consistency model
  A replicated files behaves like as a non-replicated file system
    picture: many clients on the same machine accessing files on a single disk
  If one application writes, later reads will observe that write
  What if two application concurrently write to the same file
    In file systems often undefined  --- file may have some mixed content
  What if two application concurrently write to the same directory
    One goes first, the other goes second

Sources of inconsistency
  Concurrency
  Machine failures
  Network partitions

Example from GFS paper:
  primary is partitioned from backup B
  client appends 1
  primary sends 1 to itself and backup A
  reports failure to client
  meanwhile client 2 may backup B and observe old value

Why is the ideal difficult to achieve in a distributed file system
  Protocols can become complex --- see next week
    Difficult to implement system correctly
  Protocols require communication between clients and servers
    May cost performance
    
GFS designers give up on ideal to get better performance and simpler design
  Can make life of application developers harder
    application observe behaviors that are non-observable in an ideal system
    e.g., reading stale data
    e.g., duplicate append records
    But the data isn't your bank account, so maybe ok
  Today's paper is an example of the struggle between:
   consistency
   fault-tolerance
   performance
   simplicity of design

GFS goal
  create a shared file system
  hundreds or thousands of (commodity, Linux based) physical machines
  to enable storing massive data sets

What does GFS store?
  authors don't actually say
  guesses for 2003:
    search indexes & databases
    all the HTML files on the web
    all the images on the web
    ...

Properties of files:
  Multi-terabyte data sets
  Many of the files are large
  Authors suggest 1M files x 100 MB = 100 TB
    but that was in 2003
  Files are generally append only

Central challenge:
  With so many machines failures are common
    assume a machine fails once per year
    w/ 1000 machines, ~3 will fail per day.
  High-performance: many concurrent readers and writers
    Map/Reduce jobs read and store final result in GFS
    Note: *not* the temporary, intermediate files
  Use network efficiently

High-level design
  Directories, files, names, open/read/write
    But not POSIX
  100s of Linux chunk servers with disks
    store 64MB chunks (an ordinary Linux file for each chunk)
    each chunk replicated on three servers
    Q: why 3x replication?
    Q: Besides availability of data, what does 3x replication give us?
       load balancing for reads to hot files
       affinity
    Q: why not just store one copy of each file on a RAID'd disk?
       RAID isn't commodity
       Want fault-tolerance for whole machine; not just storage device
    Q: why are the chunks so big?
  GFS master server knows directory hierarchy
    for dir, what files are in it
    for file, knows chunk servers for each 64 MB
    master keeps state in memory
      64 bytes of metadata per each chunk
    master has private recoverable database for metadata
      master can recovery quickly from power failure
    shadow masters that lag a little behind master
      can be promoted to master
      
Basic operation
  client read:
    send file name and offset to master
    master replies with set of servers that have that chunk
      clients cache that information for a little while
    ask nearest chunk server
  client write:
    ask master where to store
    maybe master chooses a new set of chunk servers if crossing 64 MB
    one chunk server is primary
    it chooses order of updates and forwards to two backups

Two different fault-tolerance plans
  One for master
  One for chunk servers

Master fault tolerance
  Single master
    Clients always talk to master
    Master orders all operations
  Stores limited information persistently
    name spaces (directories)
    file-to-chunk mappings
  Log changes to these two in a log
    log is replicated on several backups
    clients operations that modify state return *after* recording changes in *logs*
    logs play a central role in many systems we will read about
    logs play a central role in labs
  Limiting the size of the log
    Make a checkpoint of the master state
    Remove all operations from log from before checkpoint
    Checkpoint is replicated to backups
  Recovery
    replay log starting from last checkpoint
    chunk location information is recreated by asking chunk servers
  Master is single point of failure
    recovery is fast, because master state is small
      so maybe unavailable for short time
    shadow masters
      lag behind master
        they replay from the log that is replicated
      can server read-only operations, but may return stale data
    if master cannot recovery, master is started somewhere else
    must be done with great care to avoid two masters
  We will see schemes with stronger guarantees, but more complex
    see next few lectures

Chunk fault tolerance
  Master grants a chunk lease to one of the replicas
    That replica is the primary chunk server
  Primary determines orders operations
  Clients pushes data to replicas
    Replicas form a chain
    Chain respects network topology
    Allows fast replication
  Client sends write request to primary
    Primary assigns sequence number
    Primary applies change locally
    Primary forwards request to replicates
    Primary responds to client after receiving acks from all replicas
  If one replica doesn't respond, client retries
  Master replicates chunks if number replicas drop below some number
  Master rebalances replicas

Consistency of chunks
  Some chunks may get out of date
    they miss mutations
  Detect stale data with chunk version number
    before handing out a lease
      increments chunk version number
       sends it to primary and backup chunk servers
    master and chunk servers store version persistently
  Send version number also to client
  Version number allows master and client to detect stale replicas

Concurrent writes/appends
  clients may write to the same region of file concurrently
  the result is some mix of those writes--no guarantees
    few applications do this anyway, so it is fine
    concurrent writes on Unix can also result in a strange outcome
  many client may want to append concurrently to, e.g., a log file
    GFS support atomic, at-least-once append
    the primary chunk server chooses the offset where to append a record
    sends it to all replicas.
    if it fails to contact a replica, the primary reports an error to client
    client retries; if retry succeeds:
      some replicas will have the append twice (the ones that succeeded)
    the file may have a "hole" too
      when GFS pads to chunk boundary, if an append would across chunk boundary
  
Consistency model
  Strong consistency for directory operations
    Master performs changes to metadata atomically
    Directory operations follow the "ideal"
    But, when master is off-line, only shadow masters
      Read-only operations only, which may return stale data
  Weak consistency for chunk operations
    A failed mutation leaves chunks inconsistent
      The primary chunk server updated chunk
      But then failed and the replicas are out of date
    A client may read an not-up-to-date chunk
    When client refreshes lease it will learn about new version #
  Authors claims weak consistency is not a big problems for apps    
    Most file updates are append-only updates
      Application can use UID in append records to detect duplicates
      Application may just read less data (but not stale data)
    Application can use temporary files and atomic rename
    
Performance (Figure 3)
  huge aggregate throughput for read (3 copies, striping)
    125 MB/sec in aggregate
    Close to saturating network
  writes to different files lower than possible maximum
    authors blame their network stack
    it causes delays in propagating chunks from one replica to next
  concurrent appends to single file
    limited by the server that stores last chunk
    
Summary
  Important FT techniques used by GFS
    Logging & checkpointing
    Primary-backup replication for chunks
      but with consistencies
    We will these in many other systems  
  what works well in GFS?
    huge sequential reads and writes
    appends
    huge throughput (3 copies, striping)
    fault tolerance of data (3 copies)
  what less well in GFS?
    fault-tolerance of master
    small files (master a bottleneck)
    concurrent updates to same file from many clients (except appends)