6.824 2017 Lecture 10: FaRM why are we reading this paper? many people want distributed transactions but they are thought to be slow this paper suggests that needn't be true -- very surprising performance! big performance picture 90 million *replicated* *persistent* *transactions* per second (Figure 7) 1 million transactions/second per machine each with a few messages, for replication and commit very impressive a few other systems get 1 million ops/second per machine, e.g. memcached but not transactions + replicated + persistent (often not any of these!) perspective on 90 million: 10,000 Tweets per second 2,000,000 e-mails per second how do they get high performance? data must fit in total RAM (so no disk reads) non-volatile RAM (so no disk writes) one-sided RDMA (fast cross-network access to RAM) fast user-level access to NIC transaction+replication protocol that exploits one-sided RDMA NVRAM FaRM writes go to RAM, not disk -- eliminates a huge bottleneck can write RAM in 200 ns, but takes 10 ms to write hard drive, 100 us for SSD ns = nanosecond, ms = millisecond, us = microsecond but RAM loses content in power failure! not persistent by itself. why not just write to RAM of f+1 machines, to tolerate f failures? might be enough if failures were always independent but power failure is not independent -- may strike 100% of machines! so: batteries in every rack, can run machines for a few minutes power h/w notifies s/w when main power fails s/w halts all transaction processing s/w writes FaRM's RAM to SSD; may take a few minutes then machine shuts down on re-start, FaRM reads saved memory image from SSD "non-volatile RAM" what if crash prevents s/w from writing SSD? e.g bug in FaRM or kernel, or cpu/memory/hardware error FaRM copes with single-machine crashes by copying data from RAM of machines' replicas to other machines to ensure always f+1 copies crashes (other than power failure) must be independent! why is the network often a performance bottleneck? the usual setup: app app --- --- socket buffers buffers TCP TCP NIC driver driver NIC -------------------- NIC lots of expensive CPU operations: system calls copy messages interrupts and all twice if RPC slow: hard to build RPC than can deliver more than a few 100,000 / second wire b/w (e.g. 10 gigabits/second) is rarely the limit for short RPC these per-packet CPU costs are the limiting factor for small messages FaRM's network setup [hosts, 56 gbit NICs, expensive switch] NIC does "one-sided RDMA": memory read/write, not packet delivery sender says "write this data at this address", or "read this address" NIC *hardware* executes at the far end returns a "hardware acknowledgement" no interrupt, kernel, copy, read(), &c at the far end one server's throughput: 10+ million/second (Figure 2) latency: 5 microseconds (from their NSDI 2014 paper) FaRM uses RDMA in three ways: one-sided read of objects during transaction execution (also VALIDATE) RPC composed of one-sided writes to primary's logs or message queues one-sided write into backup's log application access to NIC h/w is streamlined application directly interacts with NIC -- no system calls, no kernel sender gives NIC an RDMA command for RPC, receiver s/w polls memory which RDMA writes big challenge: how to use one-sided read/write for transactions and replication? protocols we've seen require receiver CPU to actively process messages e.g. Raft and two-phase-commit let's review distributed transactions remember this example: x and y are bank balances, maybe on different servers T1: T2: add(x, 1) tmp1 = get(x) add(y, -1) tmp2 = get(y) print tmp1, tmp2 x and y start at $10 we want serializability: results should be as if transactions ran one at a time in some order only two orders are possible T1 then T2 yields 11, 9 T2 then T1 yields 10, 10 serializability allows no other result what if T1 runs entirely between T2's two get()s? would print 10,9 if the transaction protocol allowed it but it's not allowed! what if T2 runs entirely between T1's two adds()s? would print 11,10 if the transaction protocol allowed it but it's not allowed! two classes of concurrency control for transactions: pessimistic: wait for lock on first use of object; hold until commit/abort called two-phase locking conflicts cause delays optimistic: access object without locking; commit "validates" to see if OK valid: do the writes invalid: abort called Optimistic Concurrency Control (OCC) FaRM uses OCC the reason: OCC lets FaRM read using one-sided RDMA reads server storing the object does not need to set a lock, due to OCC how does FaRM validate? we'll look at Figure 4 in a minute. every FaRM server runs application transactions and stores objects an application transaction is its own transaction coordinator (TC) FaRM transaction API (simplified): txCreate() o = txRead(oid) -- RDMA o.f += 1 txWrite(oid, o) -- purely local ok = txCommit() -- Figure 4 what's in an oid? region # indexes a mapping to [ primary, backup1, ... ] target NIC can use address directly to read or write RAM so target CPU doesn't have to be involved server memory layout regions, each an array of objects object layout header with version # and lock for each other server (written by RDMA, read by polling) incoming log incoming message queue all this in non-volatile RAM (i.e. written to SSD on power failure) every region replicated on one primary, f backups -- f+1 replicas [diagram of a few regions, primary/backup] only the primary serves reads; all f+1 see commits+writes replication yields availability if <= f failures i.e. available as long as one replica stays alive; better than Raft txRead one-sided RDMA to fetch object direct from primary's memory -- fast! also fetches object's version number, to detect concurrent writes txWrite must be preceded by txRead just writes local copy; no communication transaction execution / commit protocol w/o failure -- Figure 4 let's consider steps in Figure 4 one by one thinking about concurrency control for now (not replication) LOCK (first message in commit protocol) TC sends to primary of each written object TC uses RDMA to append to its log at each primary LOCK record contains oid, version # xaction read, new value primary s/w polls log, sees LOCK, validates, sends "yes" or "no" reply message note LOCK is both logged in primary's NVRAM *and* an RPC exchange what does primary CPU do on receipt of LOCK? (for each object) if object locked, or version != what xaction read, reply "no" otherwise set the lock flag and return "yes" note: does *not* block if object is already locked TC waits for all LOCK reply messages if any "no", abort send ABORT to primaries so they can release locks returns "no" from txCommit() let's ignore VALIDATE and COMMIT BACKUP for now TC sends COMMIT-PRIMARY to primary of each written object uses RDMA to append to primary's log TC only waits for hardware ack -- does not wait for primary to process log entry TC returns "yes" from txCommit() what does primary do when it processes the COMMIT-PRIMARY in its log? copy new value over object's memory increment object's version # clear object's lock flag example: T1 and T2 both want to increment x both say tmp = txRead(x) tmp += 1 txWrite(x) ok = txCommit() x should end up with 0, 1, or 2, consistent with how many successfully committed what if T1 and T2 are exactly in step? T1: Rx0 Lx Cx T2: Rx0 Lx Cx what will happen? or T1: Rx0 Lx Cx T2: Rx0 Lx Cx or T1: Rx0 Lx Cx T2: Rx0 Lx Cx intuition for why validation checks serializability: i.e. checks "was execution one at a time?" if no conflict, versions don't change, and commit is allowed if conflict, one will see lock or changed version # what about VALIDATE in Figure 4? it is an optimization for objects that are just read by a transaction VALIDATE = one-sided RDMA read to fetch object's version # and lock flag if lock set, or version # changed since read, TC aborts does not set the lock, thus faster than LOCK+COMMIT VALIDATE example: x and y initially zero T1: if x == 0: y = 1 T2: if y == 0: x = 1 (this is a classic test example for consistency) T1,T2 yields y=1,x=0 T2,T1 yields x=1,y=0 aborts could leave x=0,y=0 but serializability forbids x=1,y=1 suppose simultaneous: T1: Rx Ly Vx Cy T2: Ry Lx Vy Cx the LOCKs will both succeed the VALIDATEs will both fail, since lock bits are both set so both will abort -- which is OK how about: T1: Rx Ly Vx Cy T2: Ry Lx Vy Cx then T1 commits, T2 still aborts since T2's Vy sees T1's lock or higher version but we can't have *both* V's before the other L's so VALIDATE seems correct in this example and fast: faster than LOCK, no COMMIT required what about fault tolerance? defense against losing data? durable? available? integrity of underway transactions despite crashes? partitions? high-level replication diagram o o region 1 o o region 2 o CM o o o ZK f+1 copies of each region to tolerate <= f failures in each region TCs send all writes to all copies (TC's COMMIT-BACKUP) not immediately available if a server crashes transaction reads and commits will wait but CM will soon notice, make a new copy, recover transactions reconfiguration one ZooKeeper cluster (a handful of replicas) stores just configuration #, set of servers in this config, and CM breaks ties if multiple servers try to become CM chooses the active partition if partitioned a Configuration Manager (CM) (not replicated) monitors liveness of all servers via rapid ping manages reconfiguration checks that at least one copy of each region exists assigns regions to primary/backup sets tells servers to make new copies manages completion of interrupted transactions let's look back the at the Figure 4 commit protocol to see how any xaction that might have committed will be visible despite failed servers. "might have committed": TC might have replied "yes" to client primary might have revealed update to a subsequent read after TC sees "yes" from all LOCKs and VALIDATEs, TC appends COMMIT-BACKUP to each backup's log after all ack, appends COMMIT-PRIMARY to each primary's log after one ack, reports "commit" to application note TC replicates to backups; primaries don't replicate COMMIT-BACKUP contains written value, enough to update backup's state why TC sends COMMIT-PRIMARY only after acks for *all* COMMIT-BACKUPs? a primary may execute as soon as it sees COMMIT-PRIMARY and show the update to other transactions so by that point each object's new value must in f+1 logs (per region) so f can fail without losing the new value if there's even one backup that doesn't have the COMMIT-BACKUP that object's writes are in only f logs all f could fail along with TC then we'd have exposed commit but maybe permanently lost one write! why TC waits for an ack from a COMMIT-PRIMARY? so that there is a complete f+1 region that's aware of the commit before then, only f backups per region knew (from COMMIT-BACKUPs) but we're assuming up to f per region to fail the basic line of reasoning for why recovery is possible: if TC could have reported "commit", or a primary could have exposed value, then all f+1 in each region have LOCK or COMMIT-BACKUP in log, so f can fail from any/every region without losing writes. if recovery sees one or more COMMIT-*, and a COMMIT-* or LOCK from each region, it commits; otherwise aborts. i.e. evidence TC decided commit, plus each object's writes. (Section 5.3, Step 7) FaRM is very impressive; does it fall short of perfection? * works best if few conflicts, due to OCC. * data must fit in total RAM. * the data model is low-level; would need e.g. SQL library. * details driven by specific NIC features; what if NIC had test-and-set? summary distributed transactions have been viewed as too slow for serious use maybe FaRM demonstrates that needn't be true