6.824 2022 Lecture 3: GFS The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung SOSP 2003 Why are we reading this paper? distributed storage is a key abstraction what should the interface/semantics look like? how should it work internally? GFS paper touches on many themes of 6.824 parallel performance, fault tolerance, replication, consistency good systems paper -- details from apps all the way to network successful real-world design influential: Hadoop's HDFS Why is distributed storage hard? high performance -> shard data over many servers many servers -> constant faults fault tolerance -> replication replication -> potential inconsistencies better consistency -> low performance What would we like for consistency? Ideal model: same behavior as a single server ideal server executes client operations one at a time even if multiple clients issued operations concurrently reads reflect previous writes even if server crashes and restarts all clients see the same data thus: suppose C1 and C2 write concurrently, and after the writes have completed, C3 and C4 read. what can they see? C1: Wx1 C2: Wx2 C3: Rx? C4: Rx? answer: either 1 or 2, but both have to see the same value. This is a "strong" consistency model. But a single server has poor fault-tolerance. Replication for fault-tolerance makes strong consistency tricky. a simple but broken replication scheme: two replica servers, S1 and S2 clients send writes to both, in parallel clients send reads to either in our example, C1's and C2's write messages could arrive in different orders at the two replicas if C3 reads S1, it might see x=1 if C4 reads S2, it might see x=2 or what if S1 receives a write, but the client crashes before sending the write to S2? that's not strong consistency! better consistency usually requires communication to ensure the replicas stay in sync -- can be slow! lots of tradeoffs possible between performance and consistency we'll see one today GFS Context: Many Google services needed a big fast unified storage system Mapreduce, crawler, indexer, log storage/analysis Shared among multiple applications e.g. crawl, index, analyze Automatic "sharding" of each file over many servers/disks For parallel performance with many clients e.g. MapReduce For huge files too big for one server Automatic recovery from failures Just one data center per deployment Just Google applications/users Aimed at sequential access to huge files; read or append I.e. not a low-latency DB for small items What was new about this in 2003? How did they get an SOSP paper accepted? Not the basic ideas of distribution, sharding, fault-tolerance. Huge scale. Used in industry, real-world experience. Successful use of weak consistency. Overall structure clients (library, RPC -- but not visible as a UNIX FS) coordinator tracks file names chunkservers store 64 MB chunks big files split into 64 MB chunks, on lots of chunkservers big chunks -> low book-keeping overhead each chunk replicated on 3 chunkservers why 3 rather than 2? Coordinator state tables in RAM (for speed, must be smallish): file name -> array of chunk handles (nv) chunk handle -> version # (nv) list of chunkservers (v) primary (v) lease time (v) "nv" state also written to disk, in case crash+reboot What are the steps when client C wants to read a file? 1. C sends filename and offset to coordinator (CO) (if not cached) 2. CO finds chunk handle for that offset 3. CO replies with list of chunkhandles + chunkservers only those with latest version 4. C caches handle + chunkserver list 5. C sends request to nearest chunkserver chunk handle, offset 6. chunk server reads from chunk file on disk, returns to client Clients only ask coordinator about file names and lists of chunk handles clients cache name -> chunkhandle info coordinator does not handle data, so (hopefully) not heavily loaded How does the coordinator know what chunkservers have a given chunk? What are the steps when C wants to do a "record append"? paper's Figure 2 1. C asks CO about file's last chunk 2. CO tells C the primary and secondaries 3. C sends data to all (just temporary...), waits for all replies (?) 4. C tells P to append 5. P checks that lease hasn't expired, and chunk has space 6. P picks an offset (at end of chunk) 7. P writes its own chunk file (a Linux file) 8. P tells each secondary the offset, tells to append to chunk file 9. P waits for all secondaries to reply, or timeout secondary can reply "error" e.g. out of disk space 10. P tells C "ok" or "error" 11. C retries from start if error What consistency guarantees does GFS provide to clients? Needs to be in a form that tells applications how to use GFS. Here's a possibility: If the primary tells a client that a record append succeeded, then any reader that subsequently opens the file and scans it will see the appended record somewhere. That allows lots of anomalies: different clients may see records in different orders; they may see duplicated records; they may see record from failed writes. GFS application must be prepared! How can we think about how GFS fulfils the guarantee? Look at its handling of various failures: crash, crash+reboot, crash+replacement, message loss, partition. Ask how GFS ensures guarantee for each kind of failure. * What if the client fails during record append sequence? * What if the appending client has cached a stale (wrong) primary for a chunk? * What if the reading client has cached a stale server list for a chunk? * Could a coordinator crash+reboot cause it to forget about the file? Or forget what chunkservers hold the relevant chunk? * Two clients do record append at exactly the same time. Will they overwrite each others' records? * Suppose one secondary doesn't hear the append command from the primary. Due to a temporary network failure. What if reading client reads from that secondary? * What if primary S1 is alive and serving client requests, but network between coordinator and S1 fails? "network partition" Will the coordinator pick a new primary? Will there now be two primaries? So that the append goes to one primary, and the read to the other? Thus breaking the consistency guarantee? "split brain" * What if the primary crashes before sending append to all secondaries? Could a secondary that *didn't* see the append be chosen as the new primary? * Chunkserver S4 with an old stale copy of chunk is offline. Primary and all live secondaries crash. S4 comes back to life (before primary and secondaries). Will coordinator choose S4 (with stale chunk) as primary? Better to have primary with ancient data, or no replicas at all? * How does the coordinator set up a primary for a chunk? If client wants to write, but no primary, or primary lease expired and dead. Coordinator has been polling chunkservesr about what chunks/versions they have. a. if no chunkservers w/ latest version #, error b. pick primary P and secondaries from those w/ latest version # c. increment version #, write to disk d. tell P and secondaries who they are, and new version # e. replicas write new version # to disk * What should a primary do if a secondary always fails writes? e.g. dead, or out of disk space, or disk has broken. Keep failing client requests? Or ask coordinator to declare a new set of servers and new version? The paper does not describe this process. * If there's a partitioned primary serving client appends, and its lease expires, and the coordinator picks a new primary, will the new primary have the latest data as updated by partitioned primary? * What if the coordinator fails altogether. Will the replacement know everything the dead coordinator knew? E.g. each chunk's version number? primary? lease expiry time? * Who/what decides the coordinator is dead, and must be replaced? Could the coordinator replicas ping the coordinator, take over if no response? * What happens if the entire building suffers a power failure? And then power is restored, and all servers reboot. * Suppose the coordinator wants to create a new chunk replica. Maybe because too few replicas. Suppose it's the last chunk in the file, and being appended to. How does the new replica ensure it doesn't miss any appends? After all it is not yet one of the secondaries. * Is there *any* circumstance in which GFS will break the guarantee? i.e. append succeeds, but subsequent readers don't see the record. All coordinator replicas permanently lose state (permanent disk failure). Could be worse: result will be "no answer", not "incorrect data". "fail-stop" All chunkservers holding the chunk permanently lose disk content. again, fail-stop; not the worse possible outcome CPU, RAM, network, or disk yields an incorrect value. checksum catches some cases, but not all Time is not properly synchronized, so leases don't work out. So multiple primaries, maybe write goes to one, read to the other. What would it take to have no anomalies -- strict consistency? I.e. all clients see the same file content. Too hard to give a real answer, but here are some issues. * All replicas should complete each write, or none. Perhaps tentative writes until all promise to complete it? Don't expose writes until all have agreed to perform them! * Primary should detect duplicate client write requests. * If primary crashes, some replicas may be missing the last few ops. New primary must talk to all replicas to find all recent ops, and sync with secondaries. * Clients must be prevented from reading from stale ex-secondaries; perhaps secondaries have leases, or clients know about chunk versions and get a lease on that version from coordinator. You'll see solutions in Labs 2 and 3! Performance (Figure 3) large aggregate throughput for read 94 MB/sec total for 16 clients or 6 MB/second per client is that good? one disk sequential throughput was about 30 MB/s one NIC was about 10 MB/s Close to saturating inter-switch link's 125 MB/sec So: per-client performance is not huge but multi-client scalability is good which is more important? Table 3 reports 500 MB/sec for production GFS, which is a lot writes to different files lower than possible maximum authors blame their network stack (but no detail) concurrent appends to single file limited by the server that stores last chunk hard to interpret after 15 years, e.g. how fast were the disks? Random issues worth considering What would it take to support small files well? What would it take to support billions of files? Could GFS be used as wide-area file system? With replicas in different cities? All replicas in one datacenter is not very fault tolerant! How long does GFS take to recover from a failure? Of a chunkserver? Of the coordinator? How well does GFS cope with slow chunkservers? Retrospective interview with GFS engineer: http://queue.acm.org/detail.cfm?id=1594206 file count was the biggest problem eventual numbers grew to 1000x those in Table 2 ! hard to fit in coordinator RAM coordinator scanning of all files/chunks for GC is slow 1000s of clients too much CPU load on coordinator applications had to be designed to cope with GFS semantics and limitations coordinator fail-over initially entirely manual, 10s of minutes BigTable is one answer to many-small-files problem and Colossus apparently shards coordinator data over many coordinators Summary case study of performance, fault-tolerance, consistency specialized for MapReduce applications good ideas: global cluster file system as universal infrastructure separation of naming (coordinator) from storage (chunkserver) sharding for parallel throughput huge files/chunks to reduce overheads primary to sequence writes leases to prevent split-brain chunkserver primaries not so great: single coordinator performance ran out of RAM and CPU chunkservers not very efficient for small files lack of automatic fail-over to coordinator replica maybe consistency was too relaxed