6.824 2020 Final Exam *** Transactions and Serializability This is a 120-minute exam. Please write down any assumptions you make. You are allowed to consult 6.824 papers, notes, and lab code. If you have questions, please ask them on Piazza in a private post to the staff. Please don't discuss the exam with anyone. Alyssa has built a storage system that provides its clients with transactions. Alyssa won't tell us how her system works, but she does guarantee that the transactions are serializable. Following Section 9.1.6 of the reading for Lecture 12, serializable means that, if you run a bunch of concurrent transactions and they produce some results (output and updated database records), there exists some serial order of those transactions that would, if followed, yield the same results. You start three concurrent transactions at about the same time: T1: begin() put(y, 2) end() T2: begin() put(x, 99) put(y, 99) put(z, 99) end() T3: begin() tmpx = get(x) tmpy = get(y) tmpz = get(z) print tmpx, tmpy, tmpz end() x, y, and z are identifiers of different database records. All three start out with value 0. put() and get() write and read database records, and begin() and end() indicate the start and finish of a transaction. There is no activity in the system other than these three transactions, and no failures occur. If T3 prints 99, 2, 99, is that a serializable result? Explain why or why not. Answer: Yes, it is a serializable result. The serial order that yields that result is T2, T1, T3. --- If T3 prints 0, 2, 99, is that a serializable result? Explain why or why not. Answer: No, it is not a serializable result. One way to demonstrate this is to try all six possible serial orders, and observe that none of them yield this result. Another is to observe that T3 saw x=0, so T3 must appear before T2 in any equivalent serial order; and T3 saw z=99, so T3 must appear after T2; but both cannot be true. *** Two-Phase Commit Recall the Two-Phase Commit protocol for distributed transactions from Lecture 12 and Section 9.6.3 of the reading for Lecture 12. A common criticism of two-phase commit is that it causes workers to hold locks for a long time; for example, in Figure 9.37 (page 9-89), each worker must hold its locks while waiting to receive the COMMIT message. What would go wrong if each worker released its locks after replying to the PREPARE message? Answer: Suppose some worker W1 is part of transaction T1. When W1 receives a PREPARE message, it doesn't yet know whether T1 will commit or abort. If W1 releases its locks after the PREPARE, and lets other transactions see values written by the not-yet-committed transaction T1, and T1 aborts, then those other transactions will use values that never existed. If W1 releases its locks but doesn't update values until T1 commits, then another transaction may aquire T1's locks and read some values as they existed before T1, and some as updated by T1, which is not generally serializable. *** Spanner Refer to the paper "Spanner: Google's Globally-Distributed Database," by Corbett et al. The start of Section 4.1.2 says that all of a read/write transactions's writes are assigned the same timestamp (the timestamp assigned to the Paxos write that commits the transaction). Suppose, instead, that each write is assigned a timestamp equal to TT.now().latest at the time when the client transaction code calls the Spanner library function to modify the data. What aspect of Spanner would this break, and why? Answer: Read-only transactions would no longer provide serializability. Spanner executes read-only transactions as of a particular time-stamp. In Spanner as described by the paper, each read-write transaction's writes all occur at the same time-stamp, and thus are either all before or all after each read-only transaction. So time-stamp order determines the equivalent serial order. But if a read-write transaction's writes have different time-stamps, then a read-only transaction could have a time-stamp that's in the middle of the read-write transaction's writes, and thus see some but not all of the read-write transation's writes. In that case there will not generally be any equivalent serial order. *** FaRM Ben is using FaRM (see the paper "No compromises: distributed transactions with consistency, availability, and performance" by Dragojevic et al). Here is Ben's transaction code: begin() if x > y: y = y + 1 else: x = x + 1 end() x and y are objects stored in FaRM. begin() and end() mark the start and end of the transaction. Ben starts two instances of this transaction at exactly the same time on two different machines. The two transactions read x and y at exactly the same time, and send LOCK messages at exactly the same time. There is no other activity in the system, and no failures occur. x and y both start out with value zero. Explain what the outcome will be, and how the FaRM commit protocol (Figure 4) arrives at that outcome. Answer: The outcome is that FaRM will tell one transaction that it has committed, and the other that it has aborted. The resulting values will be x=1 and y=0. FaRM arrives at this outcome because the server for x will process the two LOCK messages in one order or the other, and grant the lock on x to only the first LOCK; the server will respond with a failure message to the second LOCK. (The scenario in the previous question would work despite this change: while both transactions' VALIDATE(x) would succeed, one of the transactions' LOCK(y) would fail, either due to y already being locked, or due to y's version number having changed.) --- Ben thinks that some transactions might execute faster if he modified FaRM to run Figure 4's VALIDATE phase before the LOCK phase. It turns out that this change can cause incorrect results in some situations. Outline such a situation. Your answer can involve transactions other than the one shown above. Answer: Consider these two transactions: T1: if x == 0: y = 1 T2: if y == 0: x = 1 x and y start as zero. If T1 and T2 run at exactly the same time, and send VALIDATE first, then both VALIDATEs will succeed. Both LOCKs will succeed as well, since T1 and T2 lock different objects. So both transactions will commit, leaving x=1 and y=1. But that's not the result obtained by either serial order: T1 then T2 yields x=0 y=1, and T2 then T1 yields x=1 y=0. So Ben's change will break serializability in some situations. *** Spark Ben is using Spark to compute PageRank on a huge database of web links. He's using PageRank code much like that shown in Section 3.2.2 of "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing," by Zaharia et al. Ben's PageRank job takes hours to run, and he wants to find out which parts of the job take the most time. He inserts code that gets the wall-clock time before and after each line of the PageRank shown in Section 3.2.2, and prints the difference (i.e. prints how long that line took to execute). He sees that each line in the "for" loop in Section 3.2.2 takes only a fraction of a second to execute, and that the whole loop itself takes less than a second, even though the entire Spark job takes hours. Explain why the for loop takes so much less time than the whole job. Answer: The for loop contains only transformations, not actions. Spark doesn't start the computation until a call to an action. Calls to transformations merely build up a description of a lineage graph, which takes very little time. --- Ben thinks that telling Spark to cache the "ranks" RDD may make his PageRank job complete in less time, since ranks is used in each iteration of the "for" loop. He adds a call to "ranks.persist()" after each of the two assignments to ranks. However, he finds that his PageRank job takes the same amount of time to complete despite this change. Explain. Answer: Each iteration of the for loop generates a distinct new ranks RDD. Thus each ranks RDD is only used once. Thus there's no advantage in caching it. *** Memcache / Facebook Ben runs a small web site. He has a bunch of web servers, a single memcached server, and a single MySQL DB. The web servers store data persistently in the MySQL DB, and cache database records in memcached. Thus Ben's web servers act as clients to the memcached and MySQL servers. Though Ben has read "Scaling Memcache at Facebook," by Nishtala et al, he ignores all the ideas in that paper. For example, Ben's setup has nothing like Figure 6's invalidates or McSqueal, and no leases. Ben has programmed his web servers to use memcached and MySQL as follows. When a web server needs a data item, it first tries to read it from memcached; if that read misses, the web server reads the data item from MySQL. However, his web servers do not insert data read from MySQL into memcached. Instead, when one of Ben's web servers writes data, it always inserts (or updates) the data in both memcached and MySQL. Here's pseudo-code describing how Ben's web server's use memcached and MySQL: read(k): v = get(k) // is the key/value in memcached? if v is nil { // miss -- not in memcached v = fetch from DB } return v write(k,v): send k,v to DB set(k,v) // insert the new key/value into memcached Ben reasons that, since every write updates both MySQL and memcached, the two will always agree except during the small window of time between the two lines of code in write(). Explain why Ben is wrong. Answer: Suppose clients C1 and C2 both write different values to key K at about the same time. If C1's write arrives at the DB before C2's write, but C1's write arrives at memcached *after* C2's write, then memcached will disagree with the DB. This disagreement will last until the next time a client writes K, which may be a long time. *** COPS You're using COPS (see "Don't Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS," by Lloyd et al.). You have four data centers, and clients and a COPS Data Store Cluster at each data center. Four clients, each at a different data center, each start executing code that uses COPS at about the same time. Here's the code each client executes: Client C1: put(x, 1) put(y, 2) Client C2: tmp_y = get(y) put(z, 100 + tmp_y) Client C3: tmp_x = get(x) tmp_y = get(y) tmp_z = get(z) print "x=", tmp_x print "y=", tmp_y print "z=", tmp_z Client C4: tmp_z = get(z) tmp_y = get(y) tmp_x = get(x) print "x=", tmp_x print "y=", tmp_y print "z=", tmp_z Note that C3 and C4 read x, y, and z in opposite orders. All values start as zero. The clients start executing these code sequences at about the same time. The system uses COPS, not COPS-GT (that is, there are no transactions). There is no other activity in the system, just these four clients. There are no failures. Which of the following are possible outputs from C3? Mark all that apply. [ ] x=1 y=2 z=102 [ ] x=1 y=0 z=102 [ ] x=0 y=2 z=102 [ ] x=1 y=2 z=0 [ ] x=1 y=0 z=0 [ ] x=0 y=2 z=0 Answer: All of the above are possible. --- Which are possible outputs from C4? Mark all that apply. [ ] x=1 y=2 z=102 [ ] x=1 y=0 z=102 [ ] x=0 y=2 z=102 [ ] x=1 y=2 z=0 [ ] x=1 y=0 z=0 [ ] x=0 y=2 z=0 Answer: x=1 y=2 z=102 x=1 y=2 z=0 x=1 y=0 z=0 *** Lab 3 Explain what would go wrong if clients sent Lab 3 Get() RPCs to followers, and followers responded with the values in their replica of the data (without talking to the leader, and without adding anything to the log). Answer: Clients could observe stale reads if they contacted a follower that hadn't seen the latest Put()/Append(). ---- Alyssa wonders whether one could use Spanner's TrueTime (the API in Table 1 of the Spanner paper) to help design a correct version of Lab 3 in which followers can respond to client Get() RPCs. Outline a design for this. You can assume that both clients and servers have TrueTime, and that the key/value service is continuously busy with both Put() and Get() operations. Answer: Have clients supply their local timestamp `req_ts = TT.now()` along with their Get() requests. On the KV server side, for all operations that go through the log (i.e. Put() and Append()), have the leader include a local timestamp `op_ts = TT.now()` in the Command put into the Raft log. To serve a Get() (on a leader or follower) with a timestamp `req_ts`, wait until the local state machine has processed an entry from the Raft log with timestamp `op_ts` such that `op_ts.earliest > req_ts.latest`, and then serve the Get() by reading from the local key/value map. *** 6.824 Which papers should we omit in future 6.824 years, because they are not useful or are too hard to understand? [ ] 6.033 two-phase commit XXXXXX [ ] Spanner [ ] FaRM XXXXXX [ ] Spark XX [ ] Memcached at Facebook XX [ ] COPS XXXXXXXXXXX [ ] Certificate Transparency XXXXXXXXXXXXX [ ] Bitcoin X [ ] BlockStack XXXXXXXXXXXXXXXXXXX [ ] AnalogicFS XXXXXXXXXXXXXX --- Which papers did you find most useful? [ ] 6.033 two-phase commit XXXXXXXXXXXXXXXXXX [ ] Spanner XXXXXXXXXXXXXXXXXXXXXXXXXXXXX [ ] FaRM XXXXXXXXXXXXXXXXXXX [ ] Spark XXXXXXXXXXXXXXXXXXXXXXXX [ ] Memcached at Facebook XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX [ ] COPS XXXXXXXXXXXX [ ] Certificate Transparency XXXXXXXXXXXX [ ] Bitcoin XXXXXXXXXXXXXXXXX [ ] BlockStack XXXX [ ] AnalogicFS XXXX