Q: Is Spark currently in use in any major applications? A: Yes: https://databricks.com/customers Q: How common is it for PhD students to create something on the scale of Spark? A: Unusual! It is quite an accomplishment. Matei Zaharia won the ACM doctoral thesis award for this work. Q: Should we view Spark as being similar to MapReduce? A: There are similarities, but Spark can express computations that are difficult to express in a high-performance way in MapReduce, for example iterative algorithms. You can think of Spark as MapReduce and more. The authors argue that Spark is powerful enough to express a range of prior computation frameworks, including MapReduce (see section 7.1). There are systems that are better than Spark in incorporating new data that is streaming in, instead of doing batch processing. For example, Naiad (https://www.microsoft.com/en-us/research/project/naiad/). But the Spark folks have also been working on that problem (see Streaming Spark). There are also systems that allow more fine-grained sharing between different machines (e.g., DSM systems) or a system such as Picolo (http://piccolo.news.cs.nyu.edu/), which targets similar applications as Spark. Q: Why are RDDs called immutable if they allow for transformations? A: A transformation produces a new RDD, but it may not be materialized explicitly, because transformations are computed lazily. In the pagerank example, each loop iteration produces a new distrib and rank RDD, as shown in the corresponding lineage graph. Q: Do distributed systems designers worry about energy efficiency? A: Energy is a big concern in CS in general! But most of the focus in energy-efficiency in cluster computing goes to the design of data centers and the design of the computers and cooling inside the data center. Chip designers pay lots of attention to energy; for example, your processor dynamically changes the clock rate to avoid making the processor too hot. There is less focus on energy in the design of distributed systems, mostly, I think, because that is not where the big wins are. But there is some work, for example see http://www.cs.cmu.edu/~fawnproj/. Q: How do applications figure out the location of an RDD? A: The application names RDDs with variable names in Scala. Each RDD has location information associated with it, in its metadata (see Table 3). The scheduler uses all this information to colocate computations with the data. An RDD may be generated by multiple nodes but it is for different partitions of the RDD. Q: How does Spark achieve fault tolerance? A: When persisting an RDD, the programmer can specify that it must be replicated on a few machines. Spark doesn't need complicated protocols like Raft, however, because RDDs are immutable and can always be recomputed using the lineage graph.