Q: Is Spark currently in use in any major applications?

A: Yes: https://databricks.com/customers

Q: How common is it for PhD students to create something on the scale
of Spark?
 
A: Unusual! It is quite an accomplishment. Matei Zaharia won the ACM
doctoral thesis award for this work.

Q: Should we view Spark as being similar to MapReduce?

A: There are similarities, but Spark can express computations that are
difficult to express in a high-performance way in MapReduce, for
example iterative algorithms. You can think of Spark as MapReduce and
more. The authors argue that Spark is powerful enough to express a
range of prior computation frameworks, including MapReduce (see
section 7.1).

There are systems that are better than Spark in incorporating new data
that is streaming in, instead of doing batch processing. For example,
Naiad (https://www.microsoft.com/en-us/research/project/naiad/). But
the Spark folks have also been working on that problem (see Streaming
Spark).

There are also systems that allow more fine-grained sharing between
different machines (e.g., DSM systems) or a system such as Picolo
(http://piccolo.news.cs.nyu.edu/), which targets similar applications
as Spark.

Q: Why are RDDs called immutable if they allow for transformations?

A: A transformation produces a new RDD, but it may not be materialized
explicitly, because transformations are computed lazily. In the
pagerank example, each loop iteration produces a new distrib and rank
RDD, as shown in the corresponding lineage graph.

Q: Do distributed systems designers worry about energy efficiency?

A: Energy is a big concern in CS in general! But most of the focus in
energy-efficiency in cluster computing goes to the design of data
centers and the design of the computers and cooling inside the data
center.

Chip designers pay lots of attention to energy; for example, your
processor dynamically changes the clock rate to avoid making the
processor too hot.

There is less focus on energy in the design of distributed systems,
mostly, I think, because that is not where the big wins are. But there
is some work, for example see http://www.cs.cmu.edu/~fawnproj/.

Q: How do applications figure out the location of an RDD?

A: The application names RDDs with variable names in Scala. Each RDD
has location information associated with it, in its metadata (see
Table 3). The scheduler uses all this information to colocate
computations with the data. An RDD may be generated by multiple nodes
but it is for different partitions of the RDD.

Q: How does Spark achieve fault tolerance?

A: When persisting an RDD, the programmer can specify that it must be
replicated on a few machines. Spark doesn't need complicated protocols
like Raft, however, because RDDs are immutable and can always be
recomputed using the lineage graph.