ZooKeeper FAQ Q: Why are only update requests A-linearizable? A: Because the authors want reads to scale with the number of servers, so they want to them to execute a server without requiring any interaction with other servers. This comes at the cost of consistency: they are allowed to return stale data. Q: How does linearizability differ from serializability? A: Serializability is a correctness condition that is typically used for systems that provide transactions; that is, systems that support grouping multiple operations into an atomic operations. Linearizability is typically used for systems without transactions. When the Zookeeper paper refers to "serializable" in their definition of linearizability, they just mean a serial order. We talk about serializability in subsequent papers. Here is a blog post that explains the difference, if you are curious: http://www.bailis.org/blog/linearizability-versus-serializability/ Although the blog post gives precise definitions, designers are not that precise when using those terms when they describe their system, so often you have to glean from the context what correctness condition the designers are shooting for. Q: What is pipelining? Zookeeper "pipelines" the operations in the client API (create, delete, exists, etc). What pipelining means here is that these operations are executed asynchronously by clients. The client calls create, delete, sends the operation to Zookeeper and then returns. At some point later, Zookeeper invokes a callback on the client that tells the client that the operation has been completed and what the results are. This asynchronous interface allow a client to pipeline many operations: after it issues the first, it can immediately issue a second one, and so on. This allows the client to achieve high throughput; it doesn't have to wait for each operation to complete before starting a second one. A worry with pipelining is that operations that are in flight might be re-ordered, which would cause the problem that the authors to talk about in 2.3. If a the leader has many write operations in flight followed by write to ready, you don't want those operations to be re-ordered, because then other clients may observe ready before the preceding writes have been applied. To ensure that this cannot happen, Zookeeper guarantees FIFO for client operations; that is the client operations are applied in the order they have been issued. Q: What about Zookeeper's use case makes wait-free better than locking? I think you mean "blocking" -- locking (as in, using locks) and blocking (as in, waiting for a request to return before issuing another one) are very different concepts. Many RPC APIs are blocking: consider, for example, clients in Lab 2/3 -- they only ever issue one request, patiently wait for it to either return or time out, and only then send the next one. This makes for an easy API to use and reason about, but doesn't offer great performance. For example, imagine you wanted to change 1,000 keys in Zookeeper -- by doing it one at a time, you'll spend most of your time waiting for the network to transmit requests and responses (this is what labs 2/3 do!). If you could instead have *several* requests in flight at the same time, you can amortize some of this cost. Zookeeper's wait-free API makes this possible, and allows for higher performance -- a key goal of the authors' use case. Q: What does wait-free mean? A: The precise definition is as follows: A wait-free implementation of a concurrent data object is one that guarantees that any process can complete any operation in a finite number of steps, regardless of the execution speeds of the other processes. This definition was introduced in the following paper by Herlihy: https://cs.brown.edu/~mph/Herlihy91/p124-herlihy.pdf The implementation of Zookeeper API is wait-free because requests return to clients without waiting for other, slow clients or servers. Specifically, when a write is processed, the server handling the client write returns as soon as it receives the state change (ยง4.2). Likewise, clients' watches fire as a znode is modified, and the server does *not* wait for the clients to acknowledge that they've received the notification before returning to the writing client. Some of the primitives that client can implement with Zookeeper APIs are wait-free too (e.g., group membership), but others are not (e.g., locks, barrier). Q: What is the reason for implementing 'fuzzy snapshots'? How can state changes be idempotent? A: If the authors had to decided to go for consistent snapshots, Zookeeper would have to stop all writes will making a snapshot for the in-memory database. You might remember that GFS went for this plan, but for large database, this could hurt the performance of Zookeeper seriously. Instead the authors go for a fuzzy snapshot scheme that doesn't require blocking all writes while the snapshot is made. After reboot, they construct a consistent snapshot by replaying the messages that were sent after the checkpoint started. Because all updates in Zookeeper are idempotent and delivered in the same order, the application-state will be correct after reboot and replay---some messages may be applied twice (once to the state before recovery and once after recovery) but that is ok, because they are idempotent. The replay fixes the fuzzy snapshot to be a consistent snapshot of the application state. Zookeeper turns the operations in the client API into something that it calls a transaction in a way that the transaction is idempotent. For example, if a client issues a conditional setData and the version number in the request matches, Zookeeper creates a setDataTXN that contains the new data, the new version number, and updated time stamps. This transaction (TXN) is idempotent: Zookeeper can execute it twice and it will result in the same state. Q: How does ZooKeeper choose leaders? A: Zookeeper uses ZAB, an atomic broadcast system, which has leader election build in, much like Raft. If you are curious about the details, you can find a paper about ZAB here: http://dl.acm.org/citation.cfm?id=2056409 Q: How does Zookeeper's performance compare to other systems such as Paxos? A: It has impressive performance (in particular throughput); Zookeeper would beat the pants of your implementation of Raft. 3 zookeeper server process 21,000 writes per second. Your raft with 3 servers commits on the order of tens of operations per second (assuming a magnetic disk for storage) and maybe hundreds per second with SSDs. Q: How does the ordering guarantee solve the race conditions in Section 2.3? If a client issues many write operations to a z-node, and then the write to Ready, then Zookeeper will guarantee that all the writes will be applied to the z-node before the write to Ready. Thus, if another client observes Ready, then the all preceding writes must have been applied and thus it is ok for other clients to read the info in the z-node. Q: How big is the ZooKeeper database? It seems like the server must have a lot of memory. It depends on the application, and, unfortunately, the paper doesn't report on it for the different application they have used Zookeeper with. Since Zookeeper is intended for configuration services/master services (and not for a general-purpose data store), however, an in-memory database seems reasonable. For example, you could imagine using Zookeeper for GFS's master and that amount of data should fit in the memory of a well-equipped server, as it did for GFS. Q: What's a universal object? A: It is a theoretical statement of how good the API of Zookeeper is based on a theory of concurrent objects that Herlihy introduced: https://cs.brown.edu/~mph/Herlihy91/p124-herlihy.pdf. We won't spend any time on this statement and theory, but if you care there is a gentle introduction on this wikipedia page: https://en.wikipedia.org/wiki/Non-blocking_algorithm. The reason that authors appeal to this concurrent-object theory is that they claim that Zookeeper provides a good coordination kernel that enables new primitives without changing the service. By pointing out that Zookeeper is an universal object in this concurrent-object theory, they support this claim. Q: How does a client know when to leave a barrier (top of page 7)? A: Leaving the barrier involves each client watching the znodes for all other clients participating in the barrier. Each client waits for all of them to be gone. If they are all gone, they leave the barrier and continue computing. Q: Is it possible to add more servers into an existing ZooKeeper without taking the service down for a period of time? It is -- although when the original paper was published, cluster membership was static. Nowadays, ZooKeeper supports "dynamic reconfiguration": https://zookeeper.apache.org/doc/r3.5.3-beta/zookeeperReconfig.html ... and there is actually a paper describing the mechanism: https://www.usenix.org/system/files/conference/atc12/atc12-final74.pdf How do you think this compares to Raft's dynamic configuration change via overlapping consensus, which appeared two years later? Q: How are watches implemented in the client library? It depends on the implementation. In most cases, the client library probably registers a callback function that will be invoked when the watch triggers. For example, a Go client for ZooKeeper implements it by passing a channel into "GetW()" (get with watch); when the watch triggers, an "Event" structure is sent through the channel. The application can check the channel in a select clause. See https://godoc.org/github.com/samuel/go-zookeeper/zk#Conn.GetW.