ZooKeeper FAQ Q: What's the main take-away from this paper? A: ZooKeeper's main academic contribution lies in the detailed design of a storage system specialized to fault-tolerant high-performance configuration management: watches, sessions, the choice of consistency, the specific semantics of the operations. It builds on existing work such as Chubby and Paxos. A lot of what's interesting for us in 6.824 is the idea that one can obtain fault tolerance by keeping the critical state in fault-tolerant storage (ZooKeeper), and running the computation in non-fault-tolerant servers. For example, a MapReduce master might keep state about jobs, task status, workers, location of intermediate output, &c, in ZooKeeper. If the master fails, a new computer can be selected to run the MapReduce master software, and load its state from ZooKeeper. This provides fault-tolerance for the master without the complexity of state-machine replication (e.g. without having to write the MapReduce master using a Raft library). Maybe you can think of this as providing fault-tolerance by making the state alone fault-tolerant, whereas use of Raft makes the entire computation fault-tolerant. This general pattern isn't new -- for example it's how database-backed web sites work -- but ZooKeeper is a good fit if your main concern is managing fault-tolerant services. Q: Why are only update requests A-linearizable? Why not reads as well? A: The authors want high total read throughput, so they want ZooKeeper replicas to be able to satisfy client reads without involving the leader. A given replica may not know about a committed write (if it's not in the majority that the leader waited for), or may know about a write but not yet know if it is committed. Thus a replica's state may lag behind the leader and other replicas. Thus serving reads from replicas can yield data that doesn't reflect recent writes -- that is, reads can return stale results. Q: How does linearizability differ from serializability? A: The usual definition of serializability is much like linearizability, but without the requirement that operations respect real-time ordering. Have a look at this explanation: http://www.bailis.org/blog/linearizability-versus-serializability/ Section 2.3 of the ZooKeeper paper uses "serializable" to indicate that the system behaves as if writes (from all clients combined) were executed one by one in some order. The "FIFO client order" property means that reads occur at specific points in the order of writes, and that a given client's successive reads never move backwards in that order. One thing that's going on here is that the guarantees for writes and reads are different. Q: What is pipelining? A: There are two things going on here. First, the ZooKeeper leader (really the leader's Zab layer) batches together multiple client operations in order to send them efficiently over the network, and in order to efficiently write them to disk. For both network and disk, it's often far more efficient to send a batch of N small items all at once than it is to send or write them one at a time. This kind of batching is only effective if the leader sees many client requests at the same time; so it depends on there being lots of active clients. The second aspect of pipelining is that ZooKeeper makes it easy for each client to keep many write requests outstanding at a time, by supporting asynchronous operations. From the client's point of view, it can send lots of write requests without having to wait for the responses (which arrive later, as notifications after the writes commit). From the leader's point of view, that client behavior gives the leader lots of requests to accumulate into big efficient batches. A worry with pipelining is that operations that are in flight might be re-ordered, which would cause the problem that the authors to talk about in 2.3. If a the leader has many write operations in flight followed by write to ready, you don't want those operations to be re-ordered, because then other clients may observe ready before the preceding writes have been applied. To ensure that this cannot happen, Zookeeper guarantees FIFO for client operations; that is the client operations are applied in the order they have been issued. Q: What does wait-free mean? A: The precise definition: A wait-free implementation of a concurrent data object is one that guarantees that any process can complete any operation in a finite number of steps, regardless of the execution speeds of the other processes. This definition was introduced in the following paper by Herlihy: https://cs.brown.edu/~mph/Herlihy91/p124-herlihy.pdf Zookeeper is wait-free because it processes one client's requests without needing to wait for other clients to take action. This is partially a consequence of the API: despite being designed to support client/client coordination and synchronization, no ZooKeeper API call is defined in a way that would require one client to wait for another. In contrast, a system that supported a lock acquire operation that waited for the current lock holder to release the lock would not be wait-free. Ultimately, however, ZooKeeper clients often need to wait for each other, and ZooKeeper does provide a waiting mechanism -- watches. The main effect of wait-freedom on the API is that watches are factored out from other operations. The combination of atomic test-and-set updates (e.g. file creation and writes condition on version) with watches allows clients to synthesize more complex blocking abstractions (e.g. Section 2.4's locks and barriers). Q: How does the leader know the order in which a client wants a bunch of asynchronous updates to be performed? A: The paper doesn't say. The answer is likely to involve the client numbering its asynchronous requests, and the leader tracking for each client (really session) what number it should next expect. The leader has to keep state per session anyway (for client session timeouts), so it might be little extra work to track a per-session request sequence number. This information would have to be preserved when a leader fails and another server takes over, so the client sequence numbers are likely passed along in replicated log entries. Q: What does a client do if it doesn't get a reply for a request? Does it re-send, in case the network lost a request or reply, or the leader crashed before committing? How does ZooKeeper avoid re-sends leading to duplicate executions? A: The paper doesn't say how all this works. Probably the leader tracks what request numbers from each session it has received and committed, so that it can filter out duplicate requests. Lab 3 has a similar arrangement. Q: If a client submits an asynchronous write, and immediately afterwards does a read, will the read see the effect of the write? A: The paper doesn't explicitly say, but the implication of the "FIFO client order" property of Section 2.3 is that the read will see the write. That seems to imply that a ZK follower may block a read until the follower has received (from the ZK leader) all of the client's preceding writes. ZooKeeper may manage this by having the client send, in its read request, the zxid of the latest preceding operation that the client submitted. Q: What is the reason for implementing 'fuzzy snapshots'? A: A precise snapshot would correspond to a specific point in the log: the snapshot would include every write before that point, and no writes after that point; and it would be clear exactly where to start replay of log entries after a reboot to bring the snapshot up to date. However, creation of a precise snapshot requires a way to prevent any writes from happening while the snapshot is being created and written to disk. Blocking writes for the duration of snapshot creation might decrease performance a lot. The point of ZooKeeper's fuzzy snapshots is that ZooKeeper creates the snapshot from its in-memory database while allowing writes to the database. This means that a snapshot does not correspond to a particular point in the log -- a snapshot includes a more or less random subset of the writes that were concurrent with snapshot creation. After reboot, ZooKeeper constructs a consistent snapshot by replaying all log entries from the point at which the snapshot started. Because updates in Zookeeper are idempotent and delivered in the same order, the application-state will be correct after reboot and replay---some messages may be applied twice (once to the state before recovery and once after recovery) but that is OK, because they are idempotent. The replay fixes the fuzzy snapshot to be a consistent snapshot of the application state. The Zookeeper leader turns the operations in the client API into idempotent transactions. For example, if a client issues a conditional setData and the version number in the request matches, the Zookeeper leader creates a setDataTXN that contains the new data, the new version number, and updated time stamps. This transaction (TXN) is idempotent: Zookeeper can execute it twice and it will result in the same state. Q: How does ZooKeeper choose leaders? A: Zookeeper uses ZAB, an atomic broadcast system, which has leader election built in, much like Raft. Here's a paper about Zab: http://dl.acm.org/citation.cfm?id=2056409 Q: How does Zookeeper's performance compare to other systems such as Paxos? A: It has impressive performance (in particular throughput); Zookeeper would beat the pants of your implementation of Raft. 3 zookeeper servers process 21,000 writes per second. Your raft with 3 servers commits on the order of tens of operations per second (assuming a magnetic disk for storage) and maybe hundreds per second with SSDs. Q: How does the ordering guarantee solve the race conditions in Section 2.3? If a client issues many write operations to various z-nodes, and then the write to Ready, then Zookeeper will guarantee that all the writes will be applied to the z-nodes before the write to Ready. Further, all clients' reads will observe the changes in the same order. Thus, if a client observes Ready, then all the preceding writes must have been applied and thus it is OK for the client to read the info in the z-nodes. Q: How big is the ZooKeeper database? It seems like the server must have a lot of memory. It depends on the application, and, unfortunately, the paper doesn't report the authors' experience in this area. Since Zookeeper is intended for configuration and coordination, and not as a general-purpose data store, an in-memory database seems reasonable. For example, you could imagine using Zookeeper for GFS's master and that amount of data should fit in the memory of a well-equipped server, as it did for GFS. Q: What's a universal object? A: It is a theoretical statement of how good the API of Zookeeper is based on a theory of concurrent objects that Herlihy introduced: https://cs.brown.edu/~mph/Herlihy91/p124-herlihy.pdf. We won't spend any time on this statement and theory, but if you care there is a gentle introduction on this Wikipedia page: https://en.wikipedia.org/wiki/Non-blocking_algorithm. The authors appeal to this concurrent-object theory in order to show that Zookeeper's API is general-purpose: that the API includes enough features to implement any coordination scheme you'd want. Q: When might one use a double barrier? A: Here's a use for a single barrier: suppose you have a big computation that proceeds in two phases. All of the results for phase 1 must be available before phase 2 can begin. Suppose you run each phase in parallel on many workers (like MapReduce's Map, and then Reduce). A barrier can be used to enforce "all workers must finish phase 1 before any can start phase 2". If you have many phases, or a loop in which each iteration is a phase, you might want a double barrier. Q: How does a client know when to leave one of the paper's double barriers? A: Leaving the barrier involves each client watching the znodes for all other clients participating in the barrier. Each client waits for all of these znodes to disappear. Once they are all gone, the clients all leave the barrier and continue computing. Q: Is it possible to add more servers into an existing ZooKeeper without taking the service down for a period of time? It is -- although when the original paper was published, cluster membership was static. Nowadays, ZooKeeper supports "dynamic reconfiguration": https://zookeeper.apache.org/doc/r3.5.3-beta/zookeeperReconfig.html ... and there is a paper describing the mechanism: https://www.usenix.org/system/files/conference/atc12/atc12-final74.pdf How do you think this compares to Raft's dynamic configuration change via overlapping consensus, which appeared two years later? Q: How are watches implemented in the client library? It depends on the implementation. In most cases, the client library probably registers a callback function that will be invoked when the watch triggers. For example, a Go client for ZooKeeper implements it by passing a channel into "GetW()" (get with watch); when the watch triggers, an "Event" structure is sent through the channel. The application can check the channel in a select clause. See https://godoc.org/github.com/samuel/go-zookeeper/zk#Conn.GetW. Q: Why does the read lock on page 6 in the code jump to line 3 instead of line 2 like the write lock does? A: Good catch; I believe it is a bug. The correct recipe is here: https://zookeeper.apache.org/doc/r3.1.2/recipes.html#Shared+Locks Q: Section 5.2 says the request latency for a write is 1.2 milliseconds. Isn't that less than the time required for a single hard disk write? A: I don't know what's going on here. Section 4 says each new log entry is forced to disk, and seems to imply that ZooKeeper waits for the data to be on disk before proceeding. The wait seems necessary for crash recovery -- it would be bad to tell a client that a write had succeeded, only to forget about it in a power failure. How long might a disk write take? Section 5 says they use mechanical hard drives, and appending an entry to a log file on disk takes at least half a rotation on average. A typical disk spins at 7200 RPM, which means a half rotation takes about 4 milliseconds, which is longer than the reported overall latency. If the paper were published today, and said it used SSDs instead of hard drives, the 1.2 millisecond latency would make more sense, since one can write an SSD in much less than a millisecond. Another way to get fast writes is to interpose a battery-backed cache between the software and the disk -- the point of the battery is to preserve recent writes despite power failures, so they can be written to disk after the power is restored. Q: Why did the authors choose the name ZooKeeper? A: The apache zookeeper web site says this: "ZooKeeper: Because Coordinating Distributed Systems is a Zoo"