6.5840 2023 Lecture 19: Zanzibar Zanzibar: Google's Consistent, Global Authorization System (2019), by Pang et al. why are we reading this paper? a distributed application -- not a key/value store! makes sophisticated use of distributed storage another interplay of performance, caching, consistency security, which we've mostly ignored not unlike the Facebook/memcache situation global, huge load, caching, consistency 100 TB of data, 10,000 servers, 30 locations globally 10 million client queries/second Zanzibar exploits two advantages vs Facebook/memcache specific use case with particular consistency needs powerful global storage system (Spanner) how does Zanzibar fit into Google? [users, login, YouTube, Cloud, Drive, Zanzibar] services hold objects, e.g. video files need to decide whether each user access is permitted need to allow users to specify who can read/write their items Zanzibar pulls the authorization logic out into a common service eliminates redundant implementations uniform model for users eases cross-service sharing, embedding authentication vs authorization authentication ensures a user's identity -- e.g. username/password authorization decides what an (authenticated) user is permitted to do Zanzibar handles just authorization clients of Zanzibar are Google services how does Zanzibar know clients will obey permissions? all the code is written by Google, thus trustworthy *** What's the ACL model? the core Zanzibar API call: check(object, user, action) -> yes/no implemented with Access Control Lists (ACLs) ACL entries enumerate which users/groups can read/write/&c each object object#relation@user object: an identifier, perhaps a file name or unique ID relation: owner, read, write, &c user: user identifier, perhaps a number, global across Google (I'm omitting namespaces) example: o1#read@10 o1#read@20 o2#read@10 groups objects often have similar/identical sets of authorized users so Zanzibar supports entries that describe groups mit#member@10 mit#member@20 o3#read@mit#member o4#read@mit#member groups are often nested mit#member@mit-students#member mit#member@mit-staff#member groups are indispensable for ACL storage space for management, e.g. when a new person joins an organization how does Zanzibar perform an authorization check? at a high level for now client sends a "check" request specifying object, relation, user e.g. o3#read, 20 Zanzibar can look in its DB for all ACL entries starting with "o3#read" due to groups, it probably won't find o3#read@20 it will find a bunch of group names, e.g. mit#member and those groups may contain other groups and maybe one of those groups has user 20 -> "access granted" or maybe not -> "access denied" an ACL check is a search down a tree of groups object mit mit-students mit-staff 10, 30, 40 20, 45 an ACL check can be a lot of work if there are lots of big, nested groups *** Zanzibar service architecture deployed at 30 sites all over the world (Section 4) to match deployment of Google services so services have local access to Zanzibar at each site: clients Zanzibar aclservers -- handle client RPCs Spanner servers -- store ACL entries how many Spanner servers for storage? 100 terabytes (Section 4) times 30 sites * 2 replicas = 6000 terabytes perhaps 4 TB of disk storage per Spanner server so about 1500 Spanner servers, spread over 30 sites for each Spanner shard: 5 voting replicas in the USA (3.1.4) two non-voting replicas in each of 30 sites note client objects are stored in some other system e.g. GFS, Colossus, BigTable, some other Spanner deployment not in the Spanner deployment that stores ACLs *** How does Zanzibar obtain good performance? what kind of performance do they need? 10 million client qps (4.1) 200 million internal lookups/second due to group searches (4.4) could clients just read the ACL info from Spanner? how many Spanner servers would they need for 200 million reads/second? Spanner paper Table 3 says 13k reads/second/server 200 million/second would require 15,000 Spanner servers assumes perfect load-balance and they say hot-spots are a serious problem ok, caching/computing layer of aclservers between clients and Spanner how is work divided among the aclservers? by hash of object so searches down group tree cause communication RPC is (probably) "is rtm in group MIT?" why divide by object? as opposed to having one aclserver do all the work of each client request makes more effective use of aclserver cache RAM allows parallelism for complex tree searches most critically, helps with hot-spots by bringing together simultaneous accesses for a given object what's a hot spot? what causes them? Section 3.2.5 explains one example, presumably an important one one user, many search results, so checking many objects' ACLs for that user, simultaneously. object ACLs often contain groups same groups for related objects so many simultaneous tests for the *same* user X in group Y all handled by the same Spanner server it's that Spanner server that's the hot spot hotspots are not about total load, but imbalance, a sudden burst of work for a single Spanner server how does Zanzibar cope with hotspots? * aclserver sharding scheme brings requests about same object together, which helps caching. * sharding scheme allows aggregation of Spanner lookup requests, reduces overhead. * RPC result caching, e.g. "rtm is in group MIT" i.e. not just caching Spanner read results, but also computation results * "lock table" to prevent redundant cache miss fetches from Spanner e.g. if many search results need "is RTM in group 6.5840?" again, important to bring all such requests to the same aclserver so they all see the same lock table * prefetch all object#relation@* for objects that Zanzibar sees are used by many users. Section 4.4 says cache hit rate is only 10% why so low? paper doesn't say lots of users, lots of objects users may browser randomly (YouTube) Google client apps may suppress redundant requests thus Zanzibar may not see much locality (i.e. repeated requests) 10% is just hits in the RPC-level cache probably doesn't include hits in aclserver's cache of Spanner data why 10% hit rate nevertheless helpful? paper says helps specifically for hot-spots: simultaneous requests for the exact same item the win isn't decrease in total Spanner load, but decreased chance of one Spanner server seeing sudden high load what was the performance impact of all this? the paper doesn't have a detailed explanation of performance one view 4.4 says 10 million client requests result in 20 million Spanner reads dividing by the 13,000 reads/second from Spanner paper about 1500 Spanner servers required to handle this load much better than 15,000 servers! another view Zanzibar has 10,000 servers (Section 4) thus 1,000 client/requests/second/server pretty good but not amazing *** How does Zanzibar ensure cache consistency? where's there's caching, there's stale data! and consistency problems are potentially serious since Zanzibar is making security decisions why not always use the very latest ACL data? i.e. linearizable? Spanner replicas may not have seen the latest completed writes Paxos, non-voting replicas and aclserver caches would also have to be keep up to date painful, as seen in Facebook/memcache paper so Zanzibar design is not linearizable paper says the "new enemy" problem is the most important consistency scenario to get right. Example A (simplified from Section 2.2): 1. Alice removes Bob from group G. 2. then Alice adds group G to ACL of document D1. Bob should not be able to read D1. i.e. if client sees update #2, it should also see update #1. Example B: 1. Alice removes Bob from document D2's ACL. 2. then Alice updates D2. Bob should not be able to read the new content. i.e. if client sees #2, it should also see #1. What inconsistencies *are* allowed? Example C: document D has group G in its ACL Bob is a member of G Alice removes Bob from G Bob may be able to continue reading D (for a while)! Example D: document D has group G in its ACL Alice adds Bob to G Bob may not be able to read D (for a while)! Why are these inconsistencies OK? paper does not discuss for Example C, Bob could have read D and saved a copy, so letting him continue to see D for a bit is not so bad. for Example D, there's no security problem, just inconvenience. how does Zanzibar exploit weak-ish consistency to get performance? yet not cause "new enemy" problems? relevant Spanner behavior Spanner writes are linearizable (== externally consistent) you can view them as happening one at a time in order and that order conforms to wall-clock time order Zanzibar can ask Spanner to read at a specific time in the past read reflects all writes that happened before that time true even for reads of items on different Spanner shards due to TrueTime synchronization if read time is too recent, Spanner forces the read to wait since local replica may not be in the Paxos majority for recent writes Zanzibar tries to avoid that for a given client request, Zanzibar uses same timestamp for all Spanner reads in order to obtain a view that's consistent as of some point in time i.e. not a mix of old and new ACL info Zanzibar intentionally reads at a timestamp somewhat in the past so that Spanner can read without delay 3.2.1 describes adaptive scheme to choose default timestamp lag Example A works correctly regardless of what timestamp Zanzibar uses for reads before write #1: G isn't yet in D1's ACL, so Bob can't read D1 before write #2: the same after write #2: Bob isn't in group G relies on Spanner being linearizable and on Zanzibar using same timestamp to read G and D1's ACL what about requests served from Zanzibar's cached info? Zanzibar tags each cached item with its Spanner write timestamp so Zanzibar can access its cache as of a consistent point in time but what about Example B? what keeps Zanzibar from using an old timestamp to read D2's ACL? zookies fix Example B when Alice updates D2, she asks Zanzibar for a "zookie", which contains the current time. this is the "content-change" check in 2.4.4. Alice stores the zookie inside D2. when Bob asks his client to read D2, the client extracts the zookie from D2, and asks Zanzibar to perform the ACL check at the zookie's timestamp. (really no earlier than the timestamp) the result: Zanzibar uses a timestamp new enough that it sees that Bob was removed from the ACL. but not so new that it causes read delay. consistency summary exploits Spanner's linearizability and snapshot-in-time reads exploits specific requirements of security to allow use of past snapshots tags cached data with timestamps rather than trying to keep up to date zookies help avoid using snapshots that are too old what about fault tolerance? not much action here the state is all in Spanner, which is fault-tolerant it's nice to have infrastructure