6.824 2016 Lecture 17: Existential Consistency why are we reading this paper? practical storage for giant read-heavy web sites (again) technique to detect anomalies practical consistency concerns review of overall FB architecture each region has a full DB replica sharded across MySQL servers for each data item, one region (thus one DB server) is the master master DB updates slave DBs in other regions each region has lots of Front Ends (FEs) web servers, with web site logic in e.g. Python or PHP each region has lots of caches a region's caches are divided into independent clusters each cluster shards the data over its caches FE reads first try cache; use DB if cache miss each FE usually talks to just one cluster (in its region) why this overall arrangement? why multiple geographic sites (regions)? low network delay to users why a complete replica in DBs for each region? so all reads can be local why DB servers specifically? persistence, transactional writes why sharded DB servers? far too big and too much write load for one DB server why caches (why not just lots of DB servers, small shards)? caches have much higher throughput than DBs for reads i.e. higher throughput for a given # of machines why multiple cache clusters per region? each cluster has a copy of each item, can serve in parallel helps with popular items why shard across cache servers in each cluster? limited # of copies of each item, to reduce RAM use limited # of places to send invalidates after write early FB "look-aside" caching memcached cache servers (mc) driven by FEs; mc and DB not aware of each other many web sites are structured like this! read(k) v = mc-get(k) if v = nil v = fetch from local DB mc-set(k, v) return v write(k, v) send k,v to master DB mc-delete(k) asynchronously (write doesn't wait): master DB sends to slave DBs all DBs send invalidates to caches it was hard for FB to get look-aside caching right: * miss/write race entire write (incl delete) might occur between a read()'s DB fetch and mc-set() leaving memcache with *permanent* stale data users will eventually notice missing update * read-after-write anomaly C1 sends write to DB master in remote region same user reads same key; miss in memcache; read from local DB local DB hasn't seen update users may immediately notice that read doesn't see prior write! * thundering herd cause of these problems: no entity is aware of -- and ordering -- the concurrent operations on each key these problems motivated a new design (TAO) as well as the desire to systematically monitor correctness. FB's new system: TAO (Figure 1) FEs send reads/writes *through* caches -- not look-aside 2nd layer of "root" caches one root cache per shard per region root cache orders accesses to each key -- one op at a time read: "leaf" cache for key's shard in FE's cluster checks on miss, forwards to root cache for the key's shard root cache also checks, forwards read to DB on miss write: leaf, root, root in master region, db in master region, and back leaf/root caches update on the way back async: master DB replicates log to slave DBs in other regions DBs send invalidates how does TAO fix look-aside problems? * miss/write race: root cache processes one op at a time delays write until read completely done * read-after-write anomaly: caches install new value (from master DB) before returning from write FE won't issue subsequent read until write finishes * thundering herd: each cache is aware of all requests for a key, and can limit to one outstanding miss per key, and can rate-limit all misses across keys. TAO consistency model paper says: per-object sequential consistency within a cache read-after-write consistency within a cache eventual consistency between caches why per-object seq c? close to PNUTS timeline consistency allows cached values to lag (e.g. due to slow invalidates) but ensures users don't see values go *backwards* so viewers don't see information appear, then disappear also part of eventual consistency ensures agreement abt which write comes last why read-after-write? really "see your own writes" violations would be plainly visible to users what is "within a cache"? guaranteed only if FE uses a single cluster if cluster fails and FE switches, no guarantee this is a weak consistency model allows stale reads from caches (until invalidate arrives) what would a linearizable design look like? single copy of each key, writes+reads all go to master DB? like PNUTS read-latest caching, but write must *wait* for all invalidates to complete? what would the cost be? either much slower reads, or much slower writes linearizability costs you on every operation but may only make a difference in smallish window after each write why do they want a consistency checker? they *want* linearizability but would prefer not to pay for it so they want to know how far short TAO falls in practice so they check against linearizability, as well as against actual guarantees they also want a systematic way to watch for the kind of bugs that the look-aside cache had what is linearizability? there exists a total order of all operations, that matches real-time (for non-overlapping ops), and in which each read sees most recent preceding write. example: |--Wx1--| |-Wx2-| |------Rx2-------| |-Rx1-| order: Wx1 Rx1 Wx2 Rx2 obeys real-time rule -- only constrains Wx1->Wx2 and Wx1->Rx1 obeys value rule checking linearizability is tricky if writes and reads overlap in above example, either read could have returned either value multiple outcomes are often legal if concurrency but a strictly subsequent read *must* return 2 since Wx2 strictly later than Wx1 difficult to check concurrent ops even in controlled tests, like Lab 3 they want to check using passive observations! linearizability is a "local model" can spot violations by looking at traces of individual keys no need to consider multi-key traces, or per-FE traces how do they check linearizability? pick a (tiny) random subset of objects (keys) all FEs record all reads/writes on those keys offline processing of trace at end of day what's in a trace entry? key, read vs write, value hash, start time, end time, &c Figure 4: |--Wx1--| |--Wx2--| |--Rx1--| they construct a graph for each key each vertex is a read or write operation Figure 4: Wx1 -> Wx2 -> Rx1 and Wx1 -> Rx1 edge for each time relationship if op2's start time > op1's end time, edge from op1 to op2 edge for each value relationship if op2 read op1's written value, edgie from op1 to op2 the edges are constraints on any total order they reflect the two linearizability rules loop -> no total order -> not linearizable no loop -> can find a total order -> linearizable they merge each read into the write whose value it read moving the read's edges to the write Figure 4: Wx1 --> Wx2 <-- (I do not fully understand the merging story.) why is merging legal? if the trace is linearizable, all of a write's reads must happen before (in time) the next write, or be concurrent, so the merge can't cause loops. why is merging needed? i think non-merged graph has loops only when a read precedes the corresponding write we want a graph with loops when a read doesn't see preceding write perhaps merge converts to write/write edges: an edge to each write seen by a read later in time if no anomalies, these edges will all point forwards (no loops) if anomalies, may point sideways (Figure 5) or backwards (Figure 4) Figure 4 has a cycle: Wx1 precedes Wx2 in time Wx1 follows Wx2 in value visibility cycle -> no total order -> not linearizable "stale read anomaly" -- Rx1 probably from stale cache though this analysis doesn't do much to tell you what the underlying bug is. Example: Figure 5 "total order anomaly" when concurrent writes, either outcome is OK, but subsequent reads have to agree! will this technique ever say "legal" to a non-linearizable execution? yes, if incorrect clock says a stale read happened earlier than it did Wx1 Wx2 Rx1 maybe, if Wx2 Wx1 Wx2 Rx2 but it saw stale earlier write (not really an error) (and of course it says nothing about non-sampled keys) will this technique ever say "illegal" to a linearizable execution? yes, if FE clocks disagree for write/write or write/read or read/write order they also check per-obj seq consist and read-after-write any violation of weaker models is also a violation of linearizability so they just have to filter the linearizability anomalies but I do not understand how they do it! why can't they check causal consistency? (this is The Question) this is legal under causal consistency: C0: Wx1 C1: Rx1 C2: Rx0 this is illegal: C0: Wx1 C1: Rx1 Wy2 C2: Ry2 Rx0 that is, if C2 sees write to y, it should also see all writes that could have caused that write to y. can we distinguish these using just an x trace, or just a y trace? the y trace alone looks OK in the illegal trace: Wy2 Ry2 so the y trace alone isn't enough to spot illegal executions the x trace is the SAME for both the legal and illegal traces: Wx1 Rx1 Rx0 so the x trace alone isn't enough to spot illegal executions since they only trace a tiny random sample of keys, they are unlikely to see both x and y, so they cannot check causal consistency. why is linearizability a local model? i.e. why can we check it with just a small sample of keys? let's look at a classic multi-key situation initialization: x=0 y=0 T0: T1: y=1 x=1 if x==0: if y==0: print "y" print "x" we had better not see both prints! an execution that's correct under linearizability: Wy1 Rx1 Wx0 Wy0 Wx1 Ry1 trace of just x: Wx0 Wx1 Rx1 checker sees no anomalies an execution that's incorrect under linearizability: Wy1 Rx0 Wx0 Wy0 Wx1 Ry0 trace of just x: Wx0 Wx1 Rx0 draw "T" and "V" arrows for x checker sees a stale read anomaly what if Rx0 started before Wx1 ended? then x trace is legal but in that situation real-time requires Ry0 to be after Wy1, so y's trace will show a stale read anomaly. measurement results -- Table 3 I'd expect many linearizability anomalies, from stale caches. since TAO implements a relaxed consistency model the big surprise is that there are so few! only 0.004% of reads show linearizability anomalies few writes (1 in 450) locality (consistency better within writer's cache) fast invalidation messages what do the results mean? does 0.0004% suggest no need to implement linearizability? since in practice would rarely make any difference? is 0.0004% good or bad? for comments on pictures? for latest bid in an eBay auction? for a password change? what about: for counting ad impressions for paying ad customers? bank transfers? this paper is *not* about transactions! how do they think about consistency? not "correct vs incorrect" not a contract with programmer that must be fulfilled seems to be nothing special for them about any given model i.e. not "our system works if storage is per-obj seq consistent, but breaks if it isn't" their real goal is closer to freshness they want to minimize "vulnerability window" in which reads miss the latest write in order to minimize user complaints still, users can tolerate stale data, often won't notice what about ease of programming? is there anything here that would make programming easier? application code has to defend against anomalies e.g. comment appears before commented-on picture how generalizable are their results? to other storage systems or applications? most relevant to user-oriented systems with no hard requirements 0.0004% depends deeply on read-heavy, locality, and fast invalidation conclusion low-cost checking technique low-consistency design can provide remarkably high consistency! i.e. they rarely display stale data to the user sheds doubt on need for costly consistent designs consistency not always a correctness question for them, about freshness of information shown to user *not* about correct/incorrect or inter-module contracts results are Facebook-specific but perhaps they are representative of many web sites