6.824 2017 Lecture 9: Distributed Transactions Topics: distributed transactions = distributed commit + concurrency control two-phase commit What's a transaction? suppose an application wants to make complex updates to a DB how to cope with crashes and concurrency? there are many ways to apprach this problem transactions are the gold standard of convenience, though expensive example -- new account creation: begin_transaction() if "alice" not in password table: add alice to password table add alice to profile table commit_transaction() transactions must be: atomic: all writes occur, or none, even if failures serializable: final result is as if transactions executed one by one durable: committed writes survive crash and restart these are the "ACID" properties applications rely on these properties! we are interested in *distributed* transactions Distributed commit: A bunch of computers are cooperating on some task Each computer has a different role Want to ensure atomicity: all execute, or none execute Challenges: failures, performance Example: calendar system, each user has a calendar want to schedule meetings with multiple participants one server holds calendars of users A-M, another server holds N-Z [diagram: client, two servers] sched(u1, u2, t): begin_transaction() ok1 = reserve(u1, t) ok2 = reserve(u2, t) if ok1 and ok2: if commit_transaction(): print "yes" else abort_transaction() the reserve() calls are RPCs to the two calendar servers We want atomicity: both reserve, or neither reserves. What if 1st reserve() returns true, and then: 2nd reserve() returns false (time not available, or u2 doesn't exist) 2nd reserve() doesn't return (lost RPC msg, or u2's server crashes) client fails before 2nd reserve() We need a "distributed commit protocol" Idea: tentative changes, later commit or undo (abort) reserve_handler(u, t): if u[t] is free: temp_u[t] = taken -- A TEMPORARY VERSION return true else: return false commit_handler(): copy temp_u[t] to real u[t] abort_handler(): discard temp_u[t] Idea: single entity decides whether to commit to ensure agreement let's call it the Transaction Coordinator (TC) [time diagram: client, TC, A, B] client sends RPCs to A, B client's commit_transaction() sends "go" to TC TC/A/B execute distributed commit protocol... TC reports "commit" or "abort" to client We want two properties for distributed commit protocol: Correctness: if any commit, none abort if any abort, none commit Performance: (since doing nothing is correct...) if no failures, and all can commit, then commit. if failures, come to some conclusion ASAP. We're going to develop a protocol called "two-phase commit" Used by distributed databases for multi-server transactions A common pattern in many distributed systems Two-phase commit without failures: [time diagram: client, TC, A, B] client sends reserve() RPCs to A, B client sends "go" to TC TC sends "prepare" messages to A and B. A and B respond, saying whether they're willing to commit. Respond "yes" if haven't crashed, timed out, &c. If both say "yes", TC sends "commit" messages. If either says "no", TC sends "abort" messages. A/B commit if they get a commit message. I.e. they write temp_* to the real DB. Why is this correct so far? Neither can commit unless they both agreed. Crucial that neither changes mind after responding to prepare Not even if failure! What about failures? Network broken/lossy/slow Server crashes What is our goal w.r.t. failure? Resume correct operation after repair I.e. recovery, *not* availability (since no replication here) Single symptom: timeout when expecting a message. Where do hosts wait for messages? 1) TC waits for yes/no. 2) A and B wait for prepare and commit/abort. Termination protocol summary: TC t/o for yes/no -> abort B t/o for prepare, -> abort B t/o for commit/abort, B voted no -> abort B t/o for commit/abort, B voted yes -> block TC timeout while waiting for yes/no from A/B. TC has not sent any "commit" messages. So TC can safely abort, and send "abort" messages. A/B timeout while waiting for prepare from TC have not yet responded to prepare, so TC can't have decided commit so A/B can unilaterally abort respond "no" to future prepare A/B timeout while waiting for commit/abort from TC. Let's talk about just B (A is symmetric). If B voted "no", it can unilaterally abort. So what if B voted "yes"? Can B unilaterally decide to abort? No! TC might have gotten "yes" from both, and sent out "commit" to A, but crashed before sending to B. So then A would commit and B would abort: incorrect. B can't unilaterally commit, either: A might have voted "no". So: if B voted "yes", it must "block": wait for TC decision. What if B crashes and restarts? If B sent "yes" before crash, B must remember! Can't change to "no" (and thus abort) after restart Since TC may have seen previous yes and told A to commit Thus subordinates must write persistent (on-disk) state: B must remember on disk before saying "yes", including modified data. If B reboots, disk says "yes" but no "commit", B must ask TC. (this is The Question) If TC says "commit", B copies modified data to real data. What if TC crashes and restarts? If TC might have sent "commit" or "abort" before crash, TC must remember! And repeat that if anyone asks (i.e. if A/B/client didn't get msg). Thus TC must write "commit" to disk before sending commit msgs. TC can't change its mind since A/B/client may have already acted. This protocol is "two-phase commit". * All hosts that decide reach the same decision. * No commit unless everyone says "yes". * TC failure can make servers block until repair. What about concurrent transactions? x and y are bank balances x and y start out as $10 T1 is doing a transfer of $1 from x to y T1: add(x, 1) -- server A add(y, -1) -- server B T2: tmp1 = get(x) tmp2 = get(y) print tmp1, tmp2 Problem: what if T2 runs between the two add() RPCs? then T2 will print 11, 10 money will have been created! T2 should print 10,10 or 9,11 The traditional correctness definition is "serializability" results should be as if transactions ran one at a time in some order as if T1, then T2; or T2, then T1 the results for the two differ; either is OK You can test whether a specific execution is serializable by looking for a serial order that yields the same results. there's no such order for 11,10, but there is for 10,10 and 9,11 Why is serializability good for programmers? it allows application code to ignore concurrency just write the transaction to take system from one legal state to another internally, the transaction can temporarily violate invariants e.g. midway through T1 but serializability guarantees other xactions won't notice Why is serializability OK for performance? transactions that don't conflict can run in parallel since, if T3 and T4 don't conflict, *results* from T3 || T4 will be the same as T3, then T4 (and T4, then T3) How to decide if it's OK to let two transactions run in parallel? "Two-phase locking" is one way to implement serializability each database record has a lock the lock is stored at the server that stores the record transaction must wait for and acquire a record's lock before using it thus add() handler implicitly acquires lock when it uses record x or y transaction holds its locks until *after* commit or abort Why hold locks until after commit/abort? why not release as soon as done with the record? e.g. why not have T2 release x's lock after first get()? T1 could then execute between T2's get()s T2 would print 10,9 but that is not a serializable execution: neither T1;T2 nor T2;T1 What are locks really doing? When transactions conflict, locks delay one to force serial execution. When transactions don't conflict, locks allow fast parallel execution. How does locking interact with two-phase commit? Server must aquire and remember locks as it executes client requests. So client->server RPCS have two effects: acquire lock, use data. If server says "yes" to TC's prepare: Must remember locks and values across crash+restart! So must write locks+values to disk (in log), before replying "yes". If reboot, then COMMIT from TC, read locks+values from disk. If server has not said "yes" to a prepare: If crash+restart, server can release locks and discard new values. (or just forget about them during the crash) And then say "no" to TC's prepare message. Today's paper: two-phase commit in R* R and R* were hugely influential early IBM DB research projects perhaps the first paper to explain details of two-phase commit for us, two topics: how to forget transactions speeding up read-only transactions Forgetting in R*'s 2P protocol my description of 2PC assumes subordinates can ask TC at any time and that the TC remembers outcome of all transactions this is not practical -- servers must forget old transactions problem: what if subordinate sends a query about a forgotten xaction? solution in 2P: each subordinate replies with ACK to TC's COMMIT msg each subordinate can then forget TC waits for ACKs from all subordinates TC can then forget TC receives a query from subordinate about unknown transaction two cases can lead to TC not knowing abt a transaction: 1. TC received ACKs from all subordinates, and then forgot 2. TC crashed after sending prepares, before writing commit if #1, subordinate logged a COMMIT, so this can't happen if #2, TC could not have committed, so TC can reply ABORT ACK == subordinate promises not to ask about this xaction subordinate receives a COMMIT from TC abt unknown transaction (perhaps network lost its first ACK) subordinate must have prepared and committed, otherwise would still be merely prepared so subordinate can reply with ACK Speeding up read-only transactions with R*'s PA (Presumed Abort) PA eliminates a round of messages for read-only transactions (this is one of many special cases you could streamline) I'll discuss completely read-only transactions they need to lock, and hold locks until they finish, to ensure serializability for r/o versus r/w transactions so TC needs a prepare phase to ask each if it is still alive and holding locks at the end of the transaction PA does this for read-only transactions: --prepare--> <--VOTE READ-- or <--VOTE NO-- no second phase is needed! subordinates don't need to write anything subordinates can safely release locks on prepare since all reads have completed no log writes are needed! TC and subordinates can forget immediately! the reason: at all points it is safe to abort so neither TC nor subordinates ever need to send a query so TC and subordinates don't have to carefully preserve information across crash+reboot 2PC perspective Used in sharded DBs when a transaction uses data on multiple shards But it has a bad reputation: slow: multiple rounds of messages slow: disk writes locks are held over the prepare/commit exchanges; blocks other xactions TC crash can cause indefinite blocking, with locks held Thus usually used only in a single small domain E.g. not between banks, not between airlines, not over wide area Faster distributed transactions are an active research area: Lower message and persistence cost Special cases that can be handled with less work Wide-area transactions Less consistency, more burden on applications Raft and two-phase commit solve different problems! Use Raft to get high availability by replicating i.e. to be able to operate when some servers are crashed the servers all do the *same* thing Use 2PC when each subordinate does something different And *all* of them must do their part 2PC does not help availability since all servers must be up to get anything done Raft does not ensure that all servers do something since only a majority have to be alive What if you want high availability *and* distributed commit? [diagram] Each "server" should be a Raft-replicated service And the TC should be Raft-replicated Run two-phase commit among the replicated services Then you can tolerate failures and still make progress You'll build something like this to transfer shards in Lab 4