ZooKeeper FAQ Q: Why are only update requests A-linearizable? A: Because the authors want reads to scale with the number of servers, so they want to them to execute a server without requiring any interaction with other servers. This comes at the cost of consistency: they are allowed to return stale data. Q: How does linearizability differ from serializability? A: Serializability is a correctness condition that is typically used for systems that provide transactions; that is, systems that support grouping multiple operations into an atomic operations. Linearizability is typically used for systems without transactions. When the Zookeeper paper refers to "serializable" in their definition of linearizability, they just mean a serial order. We talk about serializability in subsequent papers. Here is a blog post that explains the difference, if you are curious: http://www.bailis.org/blog/linearizability-versus-serializability/ Although the blog post gives precise definitions, designers are not that precise when using those terms when they describe their system, so often you have to glean from the context what correctness condition the designers are shooting for. Q: What is pipelining? Zookeeper "pipelines" the operations in the client API (create, delete, exists, etc). What pipelining means here is that these operations are executed asynchronously by clients. The client calls create, delete, sends the operation to Zookeeper and then returns. At some point later, Zookeeper invokes a callback on the client that tells the client that the operation has been completed and what the results are. This asynchronous interface allow a client to pipeline many operations: after it issues the first, it can immediately issue a second one, and so on. This allows the client to achieve high throughput; it doesn't have to wait for each operation to complete before starting a second one. A worry with pipelining is that operations that are in flight might be re-ordered, which would cause the problem that the authors to talk about in 2.3. If a the leader has many write operations in flight followed by write to ready, you don't want those operations to be re-ordered, because then other clients may observe ready before the preceding writes have been applied. To ensure that this cannot happen, Zookeeper guarantees FIFO for client operations; that is the client operations are applied in the order they have been issued. Q: What does wait-free mean? A: The precise definition is as follows: A wait-free implementation of a concurrent data object is one that guarantees that any process can complete any operation in a finite number of steps, regardless of the execution speeds of the other processes. This definition was introduced in the following paper by Herlihy: https://cs.brown.edu/~mph/Herlihy91/p124-herlihy.pdf The implementation of Zookeeper API is wait-free. Some of the primitives that client can implement with Zookeeper APIs are wait-free too (e.g., group membership), but others are not (e.g., locks, barrier). Q: What is the reason for implementing 'fuzzy snapshots'? How can state changes be idempotent? A: If the authors had to decided to go for consistent snapshots, Zookeeper would have to stop all writes will making a snapshot for the in-memory database. You might remember that GFS went for this plan, but for large database, this could hurt the performance of Zookeeper seriously. Instead the authors go for a fuzzy snapshot scheme that doesn't require blocking all writes while the snapshot is made. After reboot, they construct a consistent snapshot by replaying the messages that were sent after the checkpoint started. Because all updates in Zookeeper are idempotent and delivered in the same order, the application-state will be correct after reboot and replay---some messages may be applied twice (once to the state before recovery and once after recovery) but that is ok, because they are idempotent. The replay fix the fuzzy snapshot to be a consistent snapshot of the application state. Zookeeper turns the operations in the client API into something that it calls a transaction in a way that the transaction is idempotent. For example, If a client issues a conditional setData and the version number in the request matches, Zookeeper creates a setDataTXN that contains the new data, the new version number, and updated time stamps. This transaction (TXN) is idempotent: Zookeeper can execute it twice and it will result in the same state. Q: How does ZooKeeper choose leaders? A: Zookeeper uses ZAB, an atomic broadcast system, which has leader election build in, much like Raft. If you are curious about the details, you can find a paper about ZAB here: http://dl.acm.org/citation.cfm?id=2056409 Q: How does Zookeeper's performance compare to other systems such as Paxos? A: It has impressive performance (in particular throughput); Zookeeper would beat the pants of your implementation of Raft. 3 zookeeper server process 21,000 writes per second. Your raft with 3 servers commits on the order of tens of operations per second (assuming a magnetic disk for storage) and maybe hundreds per second with SSDs. Q: How does the ordering guarantee solve the race conditions in Section 2.3? If a client issues many write operations to a z-node, and then the write to Ready, then Zookeeper will guarantee that all the writes will be applied to the z-node before the write to Ready. Thus, if another client observes Ready, then the all preceding writes must have been applied and thus it is ok for other clients to read the info in the z-node. Q: How big is the ZooKeeper database? It seems like the server must have a lot of memory. It depends on the application, and, unfortunately, the paper doesn't report on it for the different application they have used Zookeeper with. Since Zookeeper is intended for configuration services/master services (and not for a general-purpose data store), however, an in-memory database seems reasonable. For example, you could imagine using Zookeeper for GFS's master and that amount of data should fit in the memory of a well-equipped server, as it did for GFS. Q: What's a universal object? A: It is a theoretical statement of how good the API of Zookeeper is based on a theory of concurrent objects that Herlihy introduced: https://cs.brown.edu/~mph/Herlihy91/p124-herlihy.pdf. We won't spend any time on this statement and theory, but if you care there is a gentle introduction on this wikipedia page: https://en.wikipedia.org/wiki/Non-blocking_algorithm. The reason that authors appeal to this concurrent-object theory is that they claim that Zookeeper provides a good coordination kernel that enables new primitives without changing the service. By pointing out that Zookeeper is an universal object in this concurrent-object theory, they support this claim. Q: How does a client know when to leave a barrier (top of page 7)? A: Leaving the barrier involves each client watching the znodes for all other clients participating in the barrier. Each client waits for all of them to be gone. If they are all gone, they leave the barrier and continue computing.