6.824 2018 Lecture 15: Frangipani Frangipani: A Scalable Distributed File System Thekkath, Mann, Lee SOSP 1997 why are we reading this paper? performance via caching cache coherence decentralized design what's the overall design? a network file system works transparently with existing apps (text editors &c) much like Athena's AFS users; workstations + Frangipani; network; petal Petal: block storage service; replicated; striped+sharded for performance What's in Petal? directories, i-nodes, file content blocks, free bitmaps just like an ordinary hard disk file system Frangipani: decentralized file service; cache for performance what's the intended use? environment: single lab with collaborating engineers == the authors' research lab programming, text processing, e-mail, &c workstations in offices most file access is to user's own files need to potentially share any file among any workstations user/user collaboration one user logging into multiple workstations so: common case is exclusive access; want that to be fast but files sometimes need to be shared; want that to be correct this was a common scenario when the paper was written why is Frangipani's design good for the intended use? it caches aggressively in each workstation, for speed cache is write-back allows updates to cached files/directories without network traffic all operations entirely local to workstation -- fast including e.g. creating files, creating directories, rename, &c updates proceed without any RPCs if everything already cached so file system code must reside in the workstation, not server "decentralized" cache also helps for scalability (many workstations) servers were a serious bottleneck in previous systems what's in the Frangipani workstation cache? what if WS1 wants to create and write /u/rtm/grades? read /u/rtm information from Petal into WS1's cache add entry for "grades" just in the cache don't immediately write back to Petal! in case WS1 wants to do more modifications challenges WS2 runs "ls /u/rtm" or "cat /u/rtm/grades" will WS2 see WS1's write? write-back cache, so WS'1 writes aren' in Petal caches make stale reads a serious threat "coherence" WS1 and WS2 concurrently try to create tmp/a and tmp/b will they overwrite each others' changes? there's no central file server to sort this out! "atomicity" WS1 crashes while renaming but other workstations are still operating how to ensure no-one sees the mess? how to clean up? "crash recovery" "cache coherence" solves the "read sees write" problem the goal is linearizability AND caching there are lots of "coherence protocols" a common pattern: file servers, distributed shared memory, multi-core Frangipani's coherence protocol (simplified): lock server (LS), with one lock per file/directory owner(lock) = WS, or nil workstation (WS) Frangipani cache: cached files and directories: present, or not present cached locks: locked-busy, locked-idle, unlocked workstation rules: acquire, then read from Petal write to Petal, then release don't cache unless you hold the lock coherence protocol messages: request (WS -> LS) grant (LS -> WS) revoke (LS -> WS) release (WS -> LS) the locks are named by files/directories (really i-numbers), though the lock server doesn't actually understand anything about file systems or Petal. example: WS1 changes directory /u/rtm, then WS2 reads it WS1 LS WS2 read /u/rtm --request(/u/rtm)--> owner(/u/rtm)=WS1 <--grant(/u/rtm)--- (read+cache /u/rtm data from Petal) (create /u/rtm/grades locally) (when done, cached lock in locked-idle state) read /u/rtm <--request(/u/rtm)-- <--revoke(/u/rtm)-- (write modified /u/rtm to Petal) --release(/u/rtm)--> owner(/u/rtm)=WS2 --grant(/u/rtm)--> (read /u/rtm from Petal) the point: locks and rules force reads to see last write locks ensure that "last write" is well-defined coherence optimizations the "locked-idle" state is already an optimization Frangipani has shared read locks, as well as exclusive write locks you could imagine WS-to-WS communication, rather than via LS and Petal next challenge: atomicity what if two workstations try to create the same file at the same time? are partially complete multi-write operations visible? e.g. file create initializes i-node, adds directory entry e.g. rename (both names visible? neither?) Frangipani has transactions: WS acquires locks on all file system data that it will modify performs modifications with all locks held only releases when finished thus no other WS can see partially-completed operations and no other WS can race to perform updates (e.g. file creation) note Frangipani's locks are doing two different things: cache coherence atomic transactions next challenge: crash recovery What if a Frangipani workstation dies while holding locks? other workstations will want to continue operating... can we just revoke dead WS's locks? what if dead WS had modified data in its cache? what if dead WS had started to write back modified data to Petal? e.g. WS wrote new directory entry to Petal, but not initialized i-node this is the troubling case Is it OK to just wait until a crashed workstation reboots? Frangipani uses write-ahead logging for crash recovery So if a crashed workstation has done some Petal writes for an operation, but not all, the writes can be completed from the log Very traditional -- but... 1) Frangipani has a separate log for each workstation rather than the traditional log per shard of the data this avoids a logging bottleneck, eases decentralization but scatters updates to a given file over many logs 2) Frangipani's logs are in shared Petal storage WS2 may read WS1's log to recover from WS1 crashing Separate logs is an interesting and unusual arrangement What's in the log? log entry: (this is a bit of guess-work, paper isn't explicit) log sequence number array of updates: block #, new version #, offset, new bytes just contains meta-data updates, not file content updates example -- create file d/f produces a log entry: a two-entry update array: add an "f" entry to d's content block, with new i-number initialize the i-node for f initially the log entry is in WS local memory (not yet Petal) When WS gets lock revocation on modified directory from LS: 1) force its entire log to Petal, then 2) send the cached updated blocks to Petal, then 3) release the locks to the LS Why must WS write log to Petal before updating i-node and directory &c in Petal? Why delay writing the log until LS revokes locks? What happens when WS1 crashes while holding locks? Not much, until WS2 requests a lock that WS1 holds LS sends grant to WS1, gets no response LS times out, tells WS2 to recover WS1 from its log in Petal What does WS2 do to recover from WS1's log? Read WS1's log from Petal Perform Petal writes described by logged operation Tell LS it is done, so LS can release WS1's locks Note it's crucal that each WS log is in Petal so that it can be read by any WS for recovery. What if WS1 crashes before it even writes recent operations to the log? WS1's recent operations may be totally lost if WS1 crashes. But the file system will be internally consistent. Why is it safe to replay just one log, despite interleaved operations on same files by other workstations? Example: WS1: delete(d/f) crash WS2: create(d/f) WS3 is recovering WS1's log -- but it doesn't look at WS2's log Will recovery re-play the delete? This is The Question No -- prevented by "version number" mechanism Version number in each meta-data block (i-node) in Petal Version number(s) in each logged op is block's version plus one Recovery replays only if op's version > block version i.e. only if the block hasn't yet been updated by this op Does WS3 need to aquire the d or d/f lock? No: if version number same as before operation, WS1 couldn't have released the lock, so safe to update in Petal Why is it OK that the log doesn't hold file *content*? If a WS crashes before writing content to Petal, it will be lost. Frangipani recovery defends the file system's own data structures. Applications can use fsync() to do their own recoverably content writes. It would be too expensive for Frangipani to log content writes. Most disk file systems (e.g. Linux) are similar, so applications already know how to cope with loss of writes before crash. What if: WS1 holds a lock Network partition WS2 decides WS1 is dead, recovers, releases WS1's locks But WS1 is alive and subsequently writes data covered by the lock Locks have leases! Lock owner can't use a lock past its lease period LS doesn't start recovery until after lease expires Is Paxos (== Raft) hidden somewhere here? Yes -- choice of lock server, choice of Petal primary/backup ensures a single lock server, despite partition ensures a single primary for each Petal shard Performance? hard to judge numbers from 1997 do they hit hardware limits? disk b/w, net b/w do they scale well with more hardware? what scenarios might we care about? read/write lots of little files (e.g. reading my e-mail) read/write huge files Small file performance -- Figure 5 X axis is number of active workstations each workstation runs a file-intensive benchmark workstations use different files and directories Y axis is completion time for a single workstation flat implies good scaling == no significant shared bottleneck presumably each workstation is just using its own cache possibly Petal's many disks also yield parallel performance Big file performance each disk: 6 MB / sec Petal stripes to get more than that 7 Petal servers, 9 disks per Petal server 336 MB/s raw disk b/w, but only 100 MB/s via Petal a single Frangipani workstation, Table 3 write: 15 MB/s -- limited by network link read: 10 MB/s -- limited by weak pre-fetch (?), could be 15 lots of Frangipani workstations Figure 6 -- reads scale well with more machines Figure 7 -- writes hit hardware limits of Petal (2x for replication) For what workloads is Frangipani likely to have poor performance? files bigger than cache? lots of read/write sharing? caching requires a reasonable working set size whole-file locking a poor fit for e.g. distributed DB coherence is too good for e.g. web site back-end Petal details Petal provides Frangipani w/ fault-tolerant storage so it's worth discussing block read/write interface compatible with existing file systems looks like single huge disk, but many servers and many many disks big, high performance striped, 64-KB blocks virtual: 64-bit sparse address space, allocate on write address translation map primary/backup (one backup server) primary sends each write to the backup uses Paxos to agree on primary for each virt addr range what about recovery after crash? suppose pair is S1+S2 S1 fails, S2 is now sole server S1 restarts, but has missed lots of updates S2 remembers a list of every block it wrote! so S1 only has to read those blocks, not entire disk logging virt->phys map and missed-write info Limitations Most useful for e.g. programmer workstations, not so much otherwise Frangipani enforces permissions, so workstations must be trusted so Athena couldn't run Frangipani on Athena workstations Frangipani/Petal split is a little awkward both layers log Petal may accept updates from "down" Frangipani workstations more RPC messages than a simple file server A file system is not a great API for many applications, e.g. web site Ideas to remember client-side caching for performance cache coherence protocols decentralized complex service on simple shared storage layer per-client log for decentralized recovery