Zanzibar FAQ

Q: What does the paper mean by causally consistent?

A: Usually causal consistency refers to a model that defaults to
eventual consistency, but has the added constraint that if client C1
reads object O1 and then writes O2, and then client C2 reads O2 and
sees C1's updates, and then C2 reads O1, C2 is guaranteed to see a
version of O1 that's at least as up to date as what C1 read. Causal
consistency is more consistent than eventual consistency, but less
consistent than linearizability (or than timeline consistency, which
is the same as linearizability). Causally consistent systems usually
include a tracking mechanism that tells C2 what C1 had seen at the
time C1 wrote O2.

Zanzibar's consistency seems to be linearizability for writes (from
Spanner's timeline consistency) and snapshot consistency for reads.
Writes to Zanzibar ACLs are linearizable, and as a result writes can
be viewed as a totally ordered sequence. Each client operation (like
read and check) observes data at a single point in that sequence: a
snapshot. That point may be significantly in the past, and thus reads
generally yield stale data. If the client provided a zookie, the point
is guaranteed to be no earlier than the time in the zookie.

The only causal aspect to the design seems to be the zookies, which
cause ACL checks on an object to take place in a snapshot at a time no
earlier than when the object was last updated. The zookie contains
more or less the time at which the object was updated (Section 2.4.4),
not any information about causality specifically.

Many of the paper's mentions of causal consistency are difficult to
interpret, for example when the Abstract says that respecting causal
ordering causes Zanzibar to provide external consistency. First,
Zanzibar doesn't generally respect causal ordering; C2 may observe
some data created by C1 and nevertheless perform a subsequent
operation in a snapshot earlier than the one in which C1 operated.
Second, Zanzibar is not generally externally consistent; if C1
modifies an ACL, C2 may nevertheless observe an earlier version of the
ACL, depening on what snapshot time C2 uses. Third, in the usual
definitions, causal consistency is weaker than external consistency;
external consistency implies causal consistency, but not the other way
around.

Q: If Bob is in a document's ACL, and then someone removes Bob from
the ACL, and then Bob tries to access the document, Zanzibar may allow
Bob's access. Since the ACL check will be done in a snapshot at the
time the document was last modified (due to the zookie in the
document), which might be before Bob was removed from the document.
Why is this OK from the point of view of security?

A: The paper does not explain. Perhaps the reason is that Bob might
have read and stored a copy of the document while he was in the ACL.
That being the case, there's not much security to be gained by
forbidding Bob's access. And there's a big performance cost, since it
would require doing all ACL checks with the freshest data, which may
not have arrived yet at the local Spanner replica.

Q: How did they configure spanner? How did they set it up to be a
worldwide service with complete replicas in dozens of clusters?

A: Section 3.1.4 suggests that each Spanner tablet has five voting
replicas in the United States, plus a few dozen read-only replicas
around the world. This means that writes (and reads that need the
latest data) have to be sent to a server in the United States, which
could take a while from some parts of the world. The paper says all
the voting replicas are in the United States in order to keep Paxos
agreement time from being too long.

Q: What is the reason for zookies?

A: When a client modifies a document, the client asks Zanzibar for the
current timestamp (inside a zookie), and the client puts that zookie
inside the modified document. Later, when some client wants to read
the document and needs an ACL check, the client reads the zookie from
the document and gives it to Zanzibar. Zanzibar looks inside the
cookie, extracts the timestamp, and does the ACL check at a snapshot
no earlier than the time in the zookie. Part of the point is for
Zanzibar to use ACL information no earlier than the time at which the
document update occured, to be consistent with the document updater's
notion of who could read their update. In addition, the fact that the
zookie's timestamp is usually well in the past means that later ACL
checks can be done in an old snapshot, and don't require Zanzibar to
wait for the latest data to be replicated in the local Spanner
servers.

Q: What does the paper mean by "opaque" zookies?

A: Zanzibar puts a Spanner timestamp inside the zookie, and gives the
zookie to the client. "Opaque" means that the client is not supposed
to look inside the zookie to extract the timestamp. Nor is a client
allowed to create its own zookies. This opaqueness gives the designers
the chance to change the zookie format in the future without breaking
existing clients.

Q: What is Leopard?

A: Many ACL entries are actually references to groups, meaning things
like "anyone in group mit can read object X" or "anyone in group
mit-eecs is also a member of group mit". So when Zanzibar executes a
client's "can rtm read object X?" RPC, Zanzibar might have to search
down a tree of these groups to see if the user in question is a member
of a group that's a member of a group that's allowed to read the
object in question.

Leopard's job is to pre-compute the answers to these group searches.
Leopard produces an index that enumerates all the users which the
group hierarchy implies are transitively members of each group. So
Leopard's index would say that rtm is a transitive member of group
mit. The Leopard index allows Zanzibar to avoid doing searches in the
group tree.

The paper doesn't say under what circumstances Zanzibar uses Leopard vs
searches the group tree.

Q: What do normalized and denormalized mean?

A: The paper doesn't say what it means by these terms. But, following
the use of these terms in relational databases, normalized data
probably means ACL entries in the form they were originally specified,
as in Table 1. Denormalized data probably means information that's the
result of pre-computation of results from multiple items of normalized
data, intended to speed up lookups. For example, executing a check
operation on normalized data might require descending a tree of group
relationships; the result (if stored or cached) would be denormalized
data, since it is the result of computation on multiple items of
normalized data.

The paper says Zanzibar by default uses only normalized data, stored
in Spanner and in the aclserver caches. The usual reason for this is
that it's complex to update or invalidate denormalized data when the
underlying normalized data changes. However, avoiding pre-computation
means that Zanzibar must often do a fair amount of work to execute
client queries, for example descending trees of groups.

The paper's Leopard indexing system generates and stores pre-computed
results, which are denormalized data. Leopard keeps its data
reasonably up-to-date by watching the changelog (Figure 2). The paper
implies that Zanzibar uses Leopard only in exceptional circumstances.

Somewhat confusingly, the RPC result cache described in 3.2.5 seems to
be denormalized (the result of searches down the tree of groups), and
keeping it consistent seems relatively straightforward, since each
cached item reflects data at a single time-stamp.

Q: Section 4.4 mentions a 10% cache hit rate -- why so low?

A: One mystery here is that Section 4.4 says that Zanzibar's in-memory
caching handles 200 million lookups/second. And that Zanzibar sends only
20 million reads to Spanner per second. To me that looks like the
in-memory cache has about a 90% hit rate.

4.4 says the 10% hit rate applies to the cache of RPC results. So maybe
part of the answer is that the aclservers have two different caches: one
for RPC results, and a separate cache of records fetched from Spanner.
The paper doesn't quite say this, but it's probably true.

Maybe the paper quotes only the 10% RPC result hit rate because RPCs are
much more expensive than in-memory lookups. Or perhaps because the
authors were mostly worried about hot spots, and the RPC result cache is
most relevant for that situation.

Still, 10% is low regardless of what cache it refers to. To the extent
that the RPC result cache holds entries like "rtm is allowed to view
YouTube video XYZ", then 10% makes sense because there are a vast
number of users, and a vast number of videos, and few users are likely
to watch the same video multiple times, so you wouldn't expect much
re-use (hitting) of cache entries. Probably the real value of the RPC
result cache is for entries like "rtm is a member of group mit"; such
an entry might be used multiple times, and might be complex to compute
if it required searching sub-groups.