PNUTS FAQ Q: Who uses PNUTS? A: Yahoo is the only company using PNUTS, as far as I know. Other companies struggle with similar issues, but only some of them describe their solutions in detail. We will look at Amazon's Dynamo, a different design tailored for shopping carts, and at some work from Facebook. Google's Spanner is another approach to a similar set of problems. Q: What happens when the master fails? A: If the record's master region loses connectivity or the region is down, then *no* new master is designated. Storage units that have writes outstanding for the master must wait until the region is available again. If the YMB server at the master fails, there's a backup with a copy of the log ready to take over. Q: What is the YMB? A: I don't know of a good description of the YMB (Yahoo Message Broker), but the API is in the following spirit: publish(message) subscribe(handler) handler(message) A YMB client publishes a message. Other YMB clients subscribe and provide handlers that must be called when a message is published. If several processes are subscribed, then they all receive messages in the same order. YMB persists messages on disk, replicates them, and ensures that messages are delivered in the same order as a local YMB cluster received them for a given topic. YMB is reliable -- it keeps trying to send each message until the recipient acknowledges. Perhaps Yahoo's newer Pulsar is similar to YMB: https://yahooeng.tumblr.com/post/150078336821/open-sourcing-pulsar-pub-sub-messaging-at-scale Q: How does PNUTS use the YMB? A: The record master will publish updates to its local YMB cluster. YMB will deliver each message the YMB in each other region, and each of them will forward each message to a local PNUTS router, which will send it to the appropriate SU. Because there is only a single writer in PNUTS (i.e., the record master), the updates to a record will all be totally ordered. Because YMB doesn't lose messages (it stores them on disk, replicated), the messages are always guaranteed to be delivered eventually. Q: What are ordered and hashed tables? A: Ordered tables keep keys sorted so that range queries are efficient (e.g., return all records between key1 and key2). Ordered tables are often implemented with trees. Hash tables don't keep keys ordered, but instead assign each key to a slot in an array using a hash function on a key; they are fast when the application just needs one record associated with a particular key. In practice most applications can be made happy with these two kinds of tables. Q: For Example 1, would you have to use Read-critical() instead of Read-any() to ensure that you get the latest version of the photo permissions? Read-any() could return stale data, so it seems like you could get the old permissions when mom wants to view the photos. A: There is an assumption that the photo list and ACL are in the same record. read-any() should work fine because writes to the same record are ordered. Thus, if Alice updates the ACL first and then the photo list, a reader who observes the new photo list will also have observed the new ACL. Q: Are the contributions mentioned in Section 1.2 really fundamentally new and radical? A: Every research paper must claim contributions to convince the reviewers of the conference to which the paper was submitted that the paper is original. I think the right way to read this paper is as an experience paper in building a system where reads can be served from the local data center so that the user observes low delay. This means that the designers had to give up some consistency, because checking other data centers for the most recent write would be too expensive. Yet they don't want to make life of programmers complicated because of the lack of consistency. The paper reports how they address these goals with mostly well-known techniques, but combined in a way that led to a novel system design. The observation that the design worked well for some substantial real-world applications is itself a valuable contribution. Q: What is a hosted system? A: PNUTS operates as a shared infrastructure for many Yahoo! applications. Instead of having each Yahoo application deal with the problems of several data centers, Yahoo! provides a shared infrastructure that many application can use so that application-development is simpler. That raises the challenge what services the shared infrastructure should provide and how so that it works well for many applications. In addition, the Yahoo! team can add more more machines dynamically to the shared infrastructure to deal with load, and that might benefit several applications. Q: What is referential integrity? A: Suppose there's a table with a record per MIT student, keyed by the student's MIT ID, and there's another table with the 6.824 grade for each student in 6.824, with the student's MIT ID. If there's a record in the 6.824 grade table with MIT ID 12345, but there is not entry for MIT ID 12345 in the student table, that's a violation of referential integrity. That is, a record in one table refers to a record that ought to exist in a second table, but doesn't. Many SQL databases enforce referential integrity, but PNUTS does not. Q: What does the paper mean by partial versus total ordering? A: By "partial", the paper means that writes to different records have no ordering relationship -- two writes to different records may be executed in either order at different regions. By "total" the paper means that two writes to the same record are executed by all regions in the same order. Q: What are secondary indexes and materialized views as described in Section 6.1? A: A secondary index is an index of a table on some column. Often you want quick access to a table's records by looking up in one of a number of different columns, e.g. I might want to retrieve personnel records at MIT by employee name, or MIT ID, or department. For each, there might be an index, keyed by name of ID or department, yielding for each possible value the set of matching table records. Most databases support this kind of index. A materialized view is a pseudo-table that is maintained as a result of updates to ordinary tables. I might have an ordinary table with a record for each vote cast in an election, and a materialized view that maintains, for each candidate, the current number of votes. Materialized views are a feature of some SQL databases. Q: What challenges does geographic distribution create in a distributed system? A: The killer problem is that the communication time between replicas at different sites is long, due to the speed of light. It takes perhaps 40 milliseconds to send a message over the Internet from the east coast to the west coast. So if you ran a strict replication protocol like Raft with replicas on both coasts, it would take at least 80 milliseconds to get anything done. Many people consider that strict schemes like Raft are impractically slow if the replicas are geographically distributed. In contrast, an efficient Raft on a LAN with all the replicas in the same room can probably complete an operation in tens of microseconds, 1000 times faster. Q: What do other big web sites use for storage? A: Lots of web sites use traditional SQL databases like MySQL. Facebook uses a combination of MySQL, memcached, and custom code to push data updates to different geographically-separated replicas. Google uses BigTable and Spanner. Amazon uses Dynamo. Lots of web sites use open source databases like Redis, MongoDB, and Cassandra. Of these, the ones that support geographical replication (and are thus attacking the same problems as PNUTS) are Facebook's setup, Spanner, Dynamo, and Cassandra. Q: Why do companies publish papers about their proprietary software? Wouldn't it be better to keep this intellectual property a secret? A: Publishing interesting high-quality research is a good way to attract top-quality engineers. The value of a steady stream of excellent new employees is usually far higher than any loss caused by publishing.