GFS FAQ Q: Why is atomic record append at-least-once, rather than exactly once? It is difficult to make the append exactly once, because a primary would then need to keep state to perform duplicate detection. That state must be replicated across servers so that if the primary fails, this information isn't lost. You will implement exactly once in lab 3, but with more complicated protocols that GFS uses. Q: How does an application know what sections of a chunk consist of padding and duplicate records? A: To detect padding, applications can put a predictable magic number at the start of a valid record, or include a checksum that will likely only be valid if the record is valid. The application can detect duplicates by including unique IDs in records. Then, if it reads a record that has the same ID as an earlier record, it knows that they are duplicates of each other. GFS provides a library for applications that handles these cases. Q: How can clients find their data given that atomic record append writes it at an unpredictable offset in the file? A: Append (and GFS in general) is mostly intended for applications that read entire files. Such applications will look for every record (see the previous question), so they don't need to know the record locations in advance. For example, the file might contain the set of link URLs encountered by a set of concurrent web crawlers. The file offset of any given URL doesn't matter much; readers just want to be able to read the entire set of URLs. Q: The paper mentions reference counts -- what are they? A: They are part of the implementation of copy-on-write for snapshots. When GFS creates a snapshot, it doesn't copy the chunks, but instead increases the reference counter of each chunk. This makes creating a snapshot inexpensive. If a client writes a chunk and the master notices the reference count is greater than one, the master first makes a copy so that the client can update the copy (instead of the chunk that is part of the snapshot). You can view this as delaying the copy until it is absolutely necessary. The hope is that not all chunks will be modified and one can avoid making some copies. Q: If an application uses the standard POSIX file APIs, would it need to be modified in order to use GFS? A: Yes, but GFS isn't intended for existing applications. It is designed for newly-written applications, such as MapReduce programs. Q: How does GFS determine the location of the nearest replica? A: The paper hints that GFS does this based on the IP addresses of the servers storing the available replicas. In 2003, Google must have assigned IP addresses in such a way that if two IP addresses are close to each other in IP address space, then they are also close together in the machine room. Q: Does Google still use GFS? A: GFS is still in use by Google and is the backend of other storage systems such as BigTable. GFS's design has doubtless been adjusted over the years since workloads have become larger and technology has changed, but I don't know the details. HDFS is a public-domain clone of GFS's design, which is used by many companies. Q: Won't the master be a performance bottleneck? A: It certainly has that potential, and the GFS designers took trouble to avoid this problem. For example, the master keeps its state in memory so that it can respond quickly. The evaluation indicates that for large file/reads (the workload GFS is targeting), the master is not a bottleneck. For small file operations or directory operations, the master can keep up (see 6.2.4). Q: How acceptable is it that GFS trades correctness for performance and simplicity? A: This a recurring theme in distributed systems. Strong consistency usually requires protocols that are complex and require chit-chat between machines (as we will see in the next few lectures). By exploiting ways that specific application classes can tolerate relaxed consistency, one can design systems that have good performance and sufficient consistency. For example, GFS optimizes for MapReduce applications, which need high read performance for large files and are OK with having holes in files, records showing up several times, and inconsistent reads. On the other hand, GFS would not be good for storing account balances at a bank. Q: What if the master fails? A: There are replica masters with a full copy of the master state; an unspecified mechanism switches to one of the replicas if the current master fails (Section 5.1.3). It's possible a human has to intervene to designate the new master. At any rate, there is almost certainly a single point of failure lurking here that would in theory prevent automatic recovery from master failure. We will see in later lectures how you could make a fault-tolerant master using Raft.