Bayou FAQ

Q: A lot of Bayou's design is driven by the desire to support
disconnected operation. Is that still important today?

A: Disconnected and weakly connected operation continue to be
important, though in different guises. Dropbox, Amazon's Dynamo,
Cassandra, git, and smartphone syncing are all real-world systems with
high-level goals and properties similar to Bayou's: they allow
immediate reads and writes of a local replica even when disconnected,
they provide eventual consistency, and they have to cope with write
conflicts.

Q: Doesn't widely-available wireless Internet mean everyone is
connected all the time?
 
A: Wireless connectivity doesn't seem to have reduced the need for
disconnected operation. Perhaps the reason is that anything short of
100% connectivity means that software has to be designed to handle
periods of disconnection. For example, my laptop has WiFi but not
cellular data; there are times when I want to be able to use my laptop
yet I either don't want to pay for WiFi, or there isn't a nearby
access point. For git, I often do not want to see others' changes
right away even if my laptop is connected. I want to be able to use my
smartphone's contacts list and calendar regardless of whether it's on
the Internet.

Q: Bayou supports direct synchronization of one device to another,
e.g. over Bluetooth or infrared, without going through the Internet or
a server. Is that important?

A: Perhaps not; I think syncing devices via an intermediate server on
the Internet is probably good enough.

Q: Does anyone use Bayou today? If not, why are we reading this paper?

A: No-one uses Bayou today; it was a research prototype intended to
explore a new architecture. Bayou uses multiple ideas that are worth
knowing: eventual consistency, conflict resolution, logging operations
rather than data, use of timestamps to help agreement on order,
version vectors, and causal consistency via Lamport logical clocks.
 
Q: Has the idea of applications defining procedures for conflict
resolution been used in other distributed systems? Or has this concept
disappeared because most devices can be constantly connected?
 
A: Conflict resolution is used in many applications that involve
synchronization. For example, git will attempt to merge different
users' changes to different lines in the same file (though it doesn't
always do a great job).
 
Q: Do companies like Dropbox use protocols similar to Bayou?
 
A: I doubt Dropbox uses anything like Bayou. I suspect Dropbox moves
file content around, rather than having a log of writes. On the other
hand Dropbox is not as flexible at application-specific resolution of
conflicting updates to the same file.
 
Q: What does it mean for data to be weakly consistent?
 
A: It means that clients may see that different replicas are not
identical, though the system tries to keep them as identical as it
can.

Q: Is eventual consistency the best you can do if you want to support
disconnected operation?

A: Yes, eventual consistency (or slight improvements like causal
consistency) is the best you can do. If we want to support reads and
writes to disconnected replicas of the data (as Bayou does), we have
to tolerate users seeing stale data, and users causing conflicting
writes that are only evident after synchronization. It's nice that
it's even possible to get eventual consistency in such a system!
 
Q: It seems like writing dependency checks and merge procedures for a
variety of operations could be a tough interface for programmers to
handle. Is there anything I'm missing there?
 
A: It's true that programmer work is required. But in return Bayou
provides a very slick solution to the difficult problem of resolving
conflicting updates to the same data.

Q: Is the primary replica the single point of failure of the system?

A: Sort of. If the primary is down or unreachable, everything works
fine, except that writes won't be declared stable. For example, the
calendar application would basically work, though users would have to
live in fear that new calendar entries with low timestamps could
appear and change the displayed schedule.
 
Q: How do dependency checks detect Write-Write conflicts? The paper
says "Such conflicts can be detected by having the dependency check
query the current values of any data items being updated and ensure
that they have not changed from the values they had at the time the
Write was submitted", but I don't quite understand in this case what
the expected result of the dependency check is.
 
A: Suppose the application wants to modify calendar entry "10:00", but
only if there's no entry already. When the application runs, there's
no entry. So the application will produce this Bayou write operation:

  dependency_check = { check that the value for "10:00" is nil }
  update = { set the value for "10:00" to "staff meeting" }
  mergeproc = { ... }

If no other write modifies 10:00, then the dependency_check will succeed
on all servers, and they will all set the value to "staff meeting".

If some other write, by a different client at about the same time, sets
the 10:00 entry to "grades meeting", then that's a write/write conflict:
two different writes want to set the same DB entry to different values.
The dependency check will detect the conflict when synchronization
causes some servers to see both writes. One write will be first in the
log (because it has a lower timestamp); its dependency check will
succeed, and it will update the 10:00 entry. The write that's second in
the log will fail the dependency check, because the DB value for 10:00
is no longer nil.

That is, checking that the DB entry has the same value that it had when
the application originally ran does detect the situation in which a
conflicting write was ordered first in the log.
 
Q: When are dependency checks called?
 
A: Each server executes new log entries that it hears during
synchronization. For each new log entry, the server calls the entry's
dependency check; if the check succeeds, the server applies the entry's
update to its DB.
 
Q: If two clients make conflicting calendar reservations on partitioned servers,
do the dependency checks get called when those two servers communicate?
 
A: If the two servers synchronize, then their logs will contain each
others' reservations, and at that point both will execute the checks.
 
Q: It looks like the logic in the dependency check would take place when
you're first inserting a write operation, but you wouldn't find any
conflicts from partitioned servers.
 
A: Whenever a server synchronizes, it rolls back its DB to the earliest point at
which synchronization modified its log, and then re-executes all log entries
after that point (including the dependency checks).
 
Q: Are there any other systems in common use that employ this type of
application-sensitive conflict detection and resolution?
 
A: There are synchronization systems that have application-specific conflict
resolution. For example, when someone syncs their iPhone with their Mac, the
calendars are merged in a way that understands about the structure of
calendars. Similarly, Dropbox has an API that allows applications to intervene
when Dropbox detects conflicting updates to the same file. However, I'm not
aware of any system other than Bayou that syncs a log of writes (other systems
typically sync the actual file content).
 
Q: What are anti-entropy sessions?
 
A: This refers to synchronization between a pair of devices, during which
the devices exchange log entries to ensure they have identical logs (and
thus identical DB content).
 
Q. What is an epidemic algorithm?
 
A: A communication scheme in which pairs devices exchange data with
each other, including data they have heard from other devices. The
"epidemic" refers to the fact that, eventually, new data will spread
to all devices via these pairwise exchanges.
 
Q: Why are Write exchange sessions called anti-entropy sessions?
 
A: Perhaps because they reduce disorder. One definition of "entropy" is a
measure of disorder in a system.
 
Q: In order to know if writes are stabilized, does a server have to
contact all other servers?
 
A: Bayou has a special primary that commits (stabilizes) writes by assigning
them CSNs. You only have to contact the primary to know what's been committed.
 
Q: How much time could it take for a Write to reach all servers?
 
A: At worst, a server might never see updates, because it is broken.
At best, a server may synchronize with other servers frequently, and
thus may see updates quickly. There's not much concrete the paper can
say about this, because it depends on whether users turn off their
laptops, whether they spend a lot of time without network
connectivity, &c.
 
Q: In what case is automatic resolution not possible? Does it only depend
on the application, or is it the case that for any application, it's
possible for automatic resolution to fail?
 
A: It usually depends on the application. In the calendar example, if
I supply only two possible time slots e.g. 10:00 and 11:00, and those
slots are already taken, Bay can't resolve the conflict. But suppose
the data is "the number of people who can attend lunch on friday", and
the updates are increments to the number from people who can attend.
Those updates can always succeed -- it's always possible to add 1 to
the count.

Q: What are examples of good (quick convergence) and not-so-good anti-entropy
policies?
 
A: A example good situation is if all servers synchronize frequently
with a single server (or more generally if there's a path between
every two servers made up of frequently synchronizing pairs). A bad
situation is if some servers don't synchronize frequently, or if there
are two groups of servers, frequent synchronization within each group,
but rare synchronization between the groups.
 
Q: I don't understand why "tentative deletion may result in a tuple that appears
in the committed view but not in the full view." (very beginning of page 8)
 
A: The committed view only reflects writes that have been committed.
So if there's a tentative operation that deletes record X, record X
will be in the committed view, but not in the full view. The full view
reflects all operations (including the delete), but the committed view
reflects only committed operations (thus not the delete).
 
Q: Bayou introduces a lot of new ideas, but it's not clear which ideas
are most important for performance.
 
A: I suspect Bayou is relatively slow. Their goal was not performance, but
rather new functionality: a new kind of system that could support shared mutable
data despite intermittent network connectivity.
 
Q: What kind of information does the Undo Log contain? (e.g. does it
contain a snapshot of changed files from a Write, or the reverse
operation?) Or is this more of an implementation detail?
 
A: I suspect that the undo log contains an entry for every write, with the value
of DB record that the write modified *before* the modification. That allows
Bayou to roll back the DB in reverse log order, replacing each modified DB
record with the previous version.
 
Q: How is a particular server designated as the primary?
 
A: I think a human chooses.
 
Q: What if communication fails in the middle of an anti-entropy session?
 
A: Bayou always sends log entries in order when synchronizing, so it's OK
for the receiver to add the entries it received to its log, even though it
didn't hear subsequent entries from the sender.
 
Q: This paper doesn't talk about caching. Would we apply the same rules on the
memory model to caching too?
 
A: Each Bayou server (i.e. device like a laptop or iPhone) has a complete copy
of all the data. So there's no need for an explicit cache.

Q: What are the session guarantees mentioned by the paper?

A: Bayou allows an application to switch servers, so that a client
device can talk to a Bayou server over the network rather than having
to run a full Bayou server. But if an application switches servers
while it is running, the new server might not have seen all the writes
that the application sent to the previous servers. Session guarantees
are a technique to deal with this. The application keeps a session
version vector summarizing all the writes it has sent to any server.
When it connects to a new server, the application first checks that
the new server is as least as up to date as the session version
vector. If not, the application finds a new server that is up to date.