ZooKeeper FAQ
Q: Why are only update requests A-linearizable?

A: Because the authors want reads to scale with the number of servers,
so they want to them to execute a server without requiring any
interaction with other servers. This comes at the cost of
consistency: they are allowed to return stale data.

Q: How does linearizability differ from serializability?

A: Serializability is a correctness condition that is typically used for
systems that provide transactions; that is, systems that support
grouping multiple operations into an atomic operations. 

Linearizability is typically used for systems without transactions.
When the Zookeeper paper refers to "serializable" in their definition
of linearizability, they just mean a serial order.

We talk about serializability in subsequent papers. Here is a blog
post that explains the difference, if you are curious:
http://www.bailis.org/blog/linearizability-versus-serializability/

Although the blog post gives precise definitions, designers are not
that precise when using those terms when they describe their
system, so often you have to glean from the context what
correctness condition the designers are shooting for.

Q: What is pipelining?

Zookeeper "pipelines" the operations in the client API (create,
delete, exists, etc). What pipelining means here is that these
operations are executed asynchronously by clients. The client calls
create, delete, sends the operation to Zookeeper and then returns.
At some point later, Zookeeper invokes a callback on the client that
tells the client that the operation has been completed and what the
results are. This asynchronous interface allow a client to pipeline
many operations: after it issues the first, it can immediately issue a
second one, and so on. This allows the client to achieve high
throughput; it doesn't have to wait for each operation to complete
before starting a second one.

A worry with pipelining is that operations that are in flight might be
re-ordered, which would cause the problem that the authors to talk
about in 2.3. If a the leader has many write operations in flight
followed by write to ready, you don't want those operations to be
re-ordered, because then other clients may observe ready before the
preceding writes have been applied. To ensure that this cannot
happen, Zookeeper guarantees FIFO for client operations; that is the
client operations are applied in the order they have been issued.

Q: What does wait-free mean?

A: The precise definition is as follows: A wait-free implementation of
a concurrent data object is one that guarantees that any process can
complete any operation in a finite number of steps, regardless of the
execution speeds of the other processes. This definition was
introduced in the following paper by Herlihy:
https://cs.brown.edu/~mph/Herlihy91/p124-herlihy.pdf

The implementation of Zookeeper API is wait-free. Some of the
primitives that client can implement with Zookeeper APIs are wait-free
too (e.g., group membership), but others are not (e.g., locks,
barrier).

Q: What is the reason for implementing 'fuzzy snapshots'? How can
state changes be idempotent?

A: If the authors had to decided to go for consistent snapshots,
Zookeeper would have to stop all writes will making a snapshot for
the in-memory database. You might remember that GFS went for
this plan, but for large database, this could hurt the performance of
Zookeeper seriously. Instead the authors go for a fuzzy snapshot
scheme that doesn't require blocking all writes while the snapshot is
made. After reboot, they construct a consistent snapshot by
replaying the messages that were sent after the checkpoint started.
Because all updates in Zookeeper are idempotent and delivered in
the same order, the application-state will be correct after reboot and
replay---some messages may be applied twice (once to the state
before recovery and once after recovery) but that is ok, because they
are idempotent. The replay fix the fuzzy snapshot to be a consistent
snapshot of the application state.

Zookeeper turns the operations in the client API into something that
it calls a transaction in a way that the transaction is idempotent. For
example, If a client issues a conditional setData and the version
number in the request matches, Zookeeper creates a setDataTXN
that contains the new data, the new version number, and updated
time stamps. This transaction (TXN) is idempotent: Zookeeper can
execute it twice and it will result in the same state.

Q: How does ZooKeeper choose leaders?

A: Zookeeper uses ZAB, an atomic broadcast system, which has leader
election build in, much like Raft. If you are curious about the
details, you can find a paper about ZAB here:
http://dl.acm.org/citation.cfm?id=2056409

Q: How does Zookeeper's performance compare to other systems
such as Paxos?

A: It has impressive performance (in particular throughput); Zookeeper
would beat the pants of your implementation of Raft. 3 zookeeper
server process 21,000 writes per second. Your raft with 3 servers
commits on the order of tens of operations per second (assuming a
magnetic disk for storage) and maybe hundreds per second with
SSDs. 

Q: How does the ordering guarantee solve the race conditions in
Section 2.3?

If a client issues many write operations to a z-node, and then the
write to Ready, then Zookeeper will guarantee that all the writes
will be applied to the z-node before the write to Ready. Thus, if
another client observes Ready, then the all preceding writes must
have been applied and thus it is ok for other clients to read the info
in the z-node.

Q: How big is the ZooKeeper database? It seems like the server must
have a lot of memory.

It depends on the application, and, unfortunately, the paper doesn't
report on it for the different application they have used Zookeeper
with. Since Zookeeper is intended for configuration services/master
services (and not for a general-purpose data store), however, an
in-memory database seems reasonable. For example, you could
imagine using Zookeeper for GFS's master and that amount of data
should fit in the memory of a well-equipped server, as it did for GFS.

Q: What's a universal object?

A: It is a theoretical statement of how good the API of Zookeeper is
based on a theory of concurrent objects that Herlihy introduced:
https://cs.brown.edu/~mph/Herlihy91/p124-herlihy.pdf. We won't
spend any time on this statement and theory, but if you care there is
a gentle introduction on this wikipedia page:
https://en.wikipedia.org/wiki/Non-blocking_algorithm. 

The reason that authors appeal to this concurrent-object theory is
that they claim that Zookeeper provides a good coordination kernel
that enables new primitives without changing the service. By
pointing out that Zookeeper is an universal object in this
concurrent-object theory, they support this claim.

Q: How does a client know when to leave a barrier (top of page 7)?

A: Leaving the barrier involves each client watching the znodes for
all other clients participating in the barrier. Each client waits for
all of them to be gone. If they are all gone, they leave the barrier
and continue computing.