GFS FAQ Q: Why is atomic record append at-least-once, rather than exactly once? Section 3.1, Step 7, says that if a write fails at one of the secondaries, the client re-tries the write. That will cause the data to be appended more than once at the non-failed replicas. A different design could probably detect duplicate client requests despite arbitrary failures (e.g. a primary failure between the original request and the client's retry). You'll implement such a design in Lab 3, at considerable expense in complexity and performance. Q: How does an application know what sections of a chunk consist of padding and duplicate records? A: To detect padding, applications can put a predictable magic number at the start of a valid record, or include a checksum that will likely only be valid if the record is valid. The application can detect duplicates by including unique IDs in records. Then, if it reads a record that has the same ID as an earlier record, it knows that they are duplicates of each other. GFS provides a library for applications that handles these cases. Q: How can clients find their data given that atomic record append writes it at an unpredictable offset in the file? A: Append (and GFS in general) is mostly intended for applications that sequentially read entire files. Such applications will scan the file looking for valid records (see the previous question), so they don't need to know the record locations in advance. For example, the file might contain the set of link URLs encountered by a set of concurrent web crawlers. The file offset of any given URL doesn't matter much; readers just want to be able to read the entire set of URLs. Q: What's a checksum? A: A checksum algorithm takes a block of bytes as input and returns a single number that's a function of all the input bytes. For example, a simple checksum might be the sum of all the bytes in the input (mod some big number). GFS stores the checksum of each chunk as well as the chunk. When a chunkserver writes a chunk on its disk, it first computes the checksum of the new chunk, and saves the checksum on disk as well as the chunk. When a chunkserver reads a chunk from disk, it also reads the previously-saved checksum, re-computes a checksum from the chunk read from disk, and checks that the two checksums match. If the data was corrupted by the disk, the checksums won't match, and the chunkserver will know to return an error. Separately, some GFS applications stored their own checksums, over application-defined records, inside GFS files, to distinguish between correct records and padding. CRC32 is an example of a checksum algorithm. Q: The paper mentions reference counts -- what are they? A: They are part of the implementation of copy-on-write for snapshots. When GFS creates a snapshot, it doesn't copy the chunks, but instead increases the reference counter of each chunk. This makes creating a snapshot inexpensive. If a client writes a chunk and the master notices the reference count is greater than one, the master first makes a copy so that the client can update the copy (instead of the chunk that is part of the snapshot). You can view this as delaying the copy until it is absolutely necessary. The hope is that not all chunks will be modified and one can avoid making some copies. Q: If an application uses the standard POSIX file APIs, would it need to be modified in order to use GFS? A: Yes, but GFS isn't intended for existing applications. It is designed for newly-written applications, such as MapReduce programs. Q: How does GFS determine the location of the nearest replica? A: The paper hints that GFS does this based on the IP addresses of the servers storing the available replicas. In 2003, Google must have assigned IP addresses in such a way that if two IP addresses are close to each other in IP address space, then they are also close together in the machine room. Q: Suppose S1 is the primary for a chunk, and the network between the master and S1 fails. The master will notice and designate some other server as primary, say S2. Since S1 didn't actually fail, are there now two primaries for the same chunk? A: That would be a disaster, since both primaries might apply different updates to the same chunk. Luckily GFS's lease mechanism prevents this scenario. The master granted S1 a 60-second lease to be primary. S1 knows to stop being primary when its lease expires. The master won't grant a lease to S2 until the previous lease to S1 expires. So S2 won't start acting as primary until after S1 stops. Q: 64 megabytes sounds awkwardly large for the chunk size! A: The 64 MB chunk size is the unit of book-keeping in the master, and the granularity at which files are sharded over chunkservers. Clients could issue smaller reads and writes -- they were not forced to deal in whole 64 MB chunks. The point of using such a big chunk size is to reduce the size of the meta-data tables in the master, and to avoid limiting clients that want to do huge transfers to reduce overhead. On the other hand, files less than 64 MB in size do not get much parallelism. Q: Does Google still use GFS? A: Rumor has it that GFS has been replaced by something called Colossus, with the same overall goals, but improvements in master performance and fault-tolerance. Q: How acceptable is it that GFS trades correctness for performance and simplicity? A: This a recurring theme in distributed systems. Strong consistency usually requires protocols that are complex and require chit-chat between machines (as we will see in the next few lectures). By exploiting ways that specific application classes can tolerate relaxed consistency, one can design systems that have good performance and sufficient consistency. For example, GFS optimizes for MapReduce applications, which need high read performance for large files and are OK with having holes in files, records showing up several times, and inconsistent reads. On the other hand, GFS would not be good for storing account balances at a bank. Q: What if the master fails? A: There are replica masters with a full copy of the master state; the paper's design requires human intervention to switch to one of the replicas after a master failure (Section 5.1.3). We will see later how to build replicated services with automatic cut-over to a backup, using Raft. Q: Why 3 replicas? A: This number is chosen to minimize the probability that a chunk will be lost. I imagine is calculated based on the reliability of disks and the time it takes to create a new replica. Here is a study on disk reliability from that era: https://research.google.com/archive/disk_failures.pdf. If clients want a file to be replicated more than 3 times (e.g., to allow for more concurrent reads), then they can specify that. Q: Did having a single master turn out to be a good idea? A: That idea simplified initial deployment but was not so great in the long run. This article -- https://queue.acm.org/detail.cfm?id=1594206 -- says that as the years went by and GFS use grew, a few things went wrong. The number of files grew enough that it wasn't reasonable to store all files' metadata in the RAM of a single master. The number of clients grew enough that a single master didn't have enough CPU power to serve them. The fact that switching from a failed master to one of its backups required human intervention made recovery slow. Apparently Google's replacement for GFS, Colossus, splits the master over multiple servers, and has more automated master failure recovery. Q: What is internal fragmentation? Why does lazy allocation help? A: Internal fragmentation is the space wasted when a system uses an allocation unit larger than needed for the requested allocation. For example, in GFS the risk is an application creates a 1-byte file and 64M-1byte is wasted, because the allocation size is 64MB (a chunk). GFS avoids this potential problem, because the 64MB is lazy allocated. Every chunk is a Linux file, and so when an application creates 1-byte file, the on-disk representation of the chunk is 1-byte Linux file.