FAQ for Frangipani, Thekkath, Mann, Lee, SOSP 1997

Q: Why are we reading this paper?

A: Primarily as an illustration of cache coherence.

But there are other interesting aspects. The idea of each client
having its own log, stored in a public place so that anyone can
recover from it, is clever. Further, the logs are intertwined in an
unusual way: the updates to a given object may be spread over many
logs. This makes replaying a single log tricky (hence Frangipani's
version numbers). Building a system out of simple shared storage
(Petal) and smart but decentralized participants is interesting,
particularly recovery from the crash of one participant.

Q: How does Frangipani differ from GFS?

A: A big architectural difference is that GFS has most of the file
system logic in the servers, while Frangipani distributes the logic
over the workstations that run Frangipani. That is, Frangipani doesn't
really have a notion of file server in the way that GFS does, only
file clients. Frangipani's design makes sense when most activity is
client workstations reading and writing a single user's files, which
can occur entirely from the workstation Frangipani cache; that's a
situation in which Frangipani delivers good performance. Frangipani
has a lot of mechanism to ensure that workstation caches stay
coherent, both so that a write on one workstation is immediately
visible to a read on another workstation, and so that complex
operations (like creating a file) are atomic even if other
workstations are trying to look at the file or directory involved.
This last situation is tricky for Frangipani because there's no
designated file server that executes all operations on a given file or
directory.

In contrast, GFS doesn't have caches at all, since its focus is
sequential read and write of giant files that are too big to fit in
any cache. It gets high performance for reads of giant files by
striping each file over many GFS servers. Because GFS has no caches,
GFS does not have a cache coherence protocol.

Frangipani appears as a real file system that you can use with any
existing workstation program. GFS doesn't present as a file system in
that sense; applications have to be written explicitly to use GFS via
library calls.

Q: Why do the Petal servers in the Frangipani system have a block
interface? Why not have file servers (like AFS), that know about
things like directories and files?

One reason is that the authors developed Petal first. Petal already
solved many fault tolerance and scaling problems, so using it
simplified some aspects of the Frangipani design. And this arrangement
moves work from centralized servers to client workstations, which
helps it scale well as more workstations are added.

However, the Petal/Frangipani split makes enforcing invariants on
file-system-level structures harder, since no one entity is in charge.
Frangipani builds its own transaction system (using the lock service
and Frangipani's logs) in order to be able to make complex atomic
updates to the file system stored in Petal.

Q: Can a workstation running Frangipani break security?

A: Yes. Since the file system logic is in the client workstations, the
design places trust in the clients. A user could modify the local
Frangipani software and read/write other users' data in Petal. This
makes Frangipani unattractive if users are not trusted. It might still
make sense in a small organization, or if Frangipani ran on separate
dedicated servers (not on workstations) and talked to user
workstations with a protocol like NFS.

Q: What is Digital Equipment?

A: It's the company that the authors worked at. Digital sold computers
and systems software. Unix was originally developed on Digital
computers (though not at Digital).

Q: What does the comment "File creation takes longer..." in Section
9.2 mean?

A: Because of logging, all changes have to be written to Petal twice:
once to the log, and once to the file system. An operation must be
written to the log first (thus "write-ahead log"). Then the on-disk
file system structures can be written. Only after that can the
operation be deleted from the log. That is, a portion of the log can
only be freed after all the on-disk file system updates from the
operations in that portion have been written to Petal.

So when Frangipani's log fills, Frangipani has to stop processing new
operations, send modified blocks from its cache to Petal for the
operations in the portion of the log it wants to re-use, and only then
free that portion and resume processing new operations.

You might wonder why increasing the log size improved performance.
After all, Frangipani has to perform the updates to the file system in
Petal no matter how long the log is, and those updates seem to be the
limiting performance factor. The paper doesn't say why increasing the
log size helps. Here's a guess. It's probably the case that the
benchmark involves updating the same directory many times, to add new
files to that directory. That directory's blocks will have to be
written to Petal every time part of the log is freed (assuming every
creation modifies the one directory). So letting the log grow longer
reduces the total number of times that directory has to be written to
Petal during the benchmark, which decreases benchmark running time.

Q: Frangipani is over 20 years old now; what's the state-of-the-art in
distributed file systems?

A: While people today do use distributed file systems for workstations
within an organization, such file systems have not recently been a
very active area of research and development. Lots of people use old
protocols such as SMB, NFS, and AFS. Some more recent systems are
xtreemfs, Ceph, and Lustre. A huge amount of network storage is sold
by companies like NetAPP and EMC, but I don't know to what extent
people use the storage at the disk level (via e.g. iSCSI) or as file
servers (e.g. talking to a NetAPP server with NFS).

Q: The paper says Frangipani only does crash recovery for its own file
system meta-data (i-nodes, directories, free bitmaps), but not for
users' file content. What does that mean, and why is it OK?

A: If a user on a workstation writes some data to a Frangipani file
system, and then the workstation immediately crashes, the recently
written data may be lost. That is, if the user logs into another
workstation and looks at the file, the recently written data may be
missing.

Ordinary Unix file systems (e.g. Linux on a laptop) have the same
property: file content written just before a crash may be lost.

Programs that care about crash recovery (e.g. text editors and
databases) can ask Unix to be more careful, at some expense in
performance. In particular, an application can call fsync(), which
tells Unix to force the data to disk immediately so that it will
survive a crash. The paper mentions fsync() in Section 2.1.

The rationale here is that the file system carefully defends its own
internal invariants (on its own meta-data), since otherwise the file
system might not be usable at all after a crash. But the file system
leaves maintaining of invariants in file content up to applications,
since only the application knows which data must be carefully (and
expensively) written to disk right away.

Q: What's the difference between the log stored in the Frangipani
workstation and the log stored on Petal?

The Petal paper hardly mentions Petal's log at all. My guess is that
Petal logs changes to the mapping from logical block number to current
disk location for that block, and logs changes to the "busy" bits that
indicate which block updates the other copy of the block may be missing.
That is, Petal logs information about low-level block operations.

Frangipani logs information about file system operations, which often
involve updating multiple pieces of file system state in Petal. For
example, a Frangipani log entry for deleting a file may say that the
delete operation modified some block and i-node free bitmap bits, and
that the operation erased a particular directory entry.

Both of the logs are stored on Petal's disks. From Petal's point of
view, Frangipani's logs are just data stored in Petal blocks; Petal does
not know anything special about Frangipani's logs.

Q: How does Petal take efficient snapshots of the large virtual disk that
it represents?

A: Petal maintains a mapping from virtual to physical block numbers. The
mapping is actually indexed by a pair: the virtual block number and an
epoch number. There is also a notion of the current epoch number. When
Petal performs a write to a virtual block, it looks at the epoch number
of the current mapping; if the epoch number is less than the current
epoch number, Petal creates a new mapping (and allocates a new physical
block) with the current epoch. A read for a virtual block uses the
mapping with the highest epoch.

Creating a snapshot then merely requires incrementing the current epoch
number. Reading from a snapshot requires the epoch number of the
snapshot to be specified; then each read uses the mapping for the
requested virtual block number that has the highest epoch number <= the
snapshot epoch.

Have a look at Section 2.2 of the Petal paper:

http://www.scs.stanford.edu/nyu/02fa/sched/petal.pdf

Q: The paper says that Frangipani doesn't immediately send new log
entries to Petal. What happens if a workstation crashes after
a system call completes, but before it send the corresponding
log entry to Petal?

A: Suppose an application running on a Frangipani workstation creates
a file. For a while, the information about the newly created file
exists only in the cache (in RAM) of that workstation. If the workstation
crashes before it writes the information to Petal, the new file is
completely lost. It won't be recovered.

So application that looks like

  create file x;
  print "I created file x";

might print "I created file x", but (if it then crashes) file x might
nevertheless not exist.

This may seem unfortunate, but it's fairly common even for local disk
file systems such as Linux. An application that needs to be sure its
data is really permanently saved can call fsync().

Q: What does it mean to stripe a file? Is this similar to sharding? 

A: It's similar to sharding. More specifically it means to distribute
the blocks of a single file over multiple servers (Petal server, for
this paper), so that (for example) the file's block 0 goes on server
0, block 1 goes on server 1, block 2 goes on server 0, block 3 goes on
server 1, &c. One reason to do this is to get high performance for
reads of a single big file, since different parts of the same file can
be read from different servers/disks. Another reason is to ensure that
load is balanced evenly over the servers.

Q: What is the "false sharing" problem mentioned in the paper?

A: The system reads and writes Petal at the granularity of a 512-byte
block. If it stored unrelated items X and Y on the same 512-byte
block, and one workstation needed to modify X while another needed to
modify Y, they would have to bounce the single copy of the block back
and forth between them. Because there's not a way to ask Petal to
write less than 512 bytes at a time. It's sharing because both
workstations are using the same block; it's false because they don't
fundamentally need to.
> Second, I don’t understand the section 4 bit about how you
> can enforce the stronger condition of never replaying a completed
> update. How is that stronger and why do we care?

Here's an example of the problem. Suppose workstation WS1 creates file
xxx, then deletes it. After that, a different workstation WS2 creates
file xxx. Their logs will look like this:

WS1:  create(xxx)  delete(xxx)
WS2:                           create(xxx)

At this point, correctness requires that xxx exist (since WS2's create
came after WS1's delete).

Now suppose WS1 crashes, and Frangipani recovery replays WS1's log.
Recovery will see the delete(xxx) operation in the log. We know that it
would be incorrect to actually do the delete, because we know that WS2's
create(xxx) came after WS1's delete(xxx). But how does Frangipani's
recovery software conclude that it should ignore the delete(xxx) when
replaying WS1's log?

Section 4 is saying that Frangipani's recovery software knows to ignore
the delete(xxx) because it "never replays a log record desccribing an
update that has already been completed." It achieves this property by
keeping a version number in every block of meta-data, and in every log
entry; if the log entry's version number is <= the meta-data block's
version number, Frangipani recovery knows that the meta-data block has
already been updated by the log entry, and thus that the log entry
should be ignored.