Q: Is Spark currently in use in any major applications?

A: Yes: https://databricks.com/customers

Q: How common is it for PhD students to create something on the scale
of Spark?

A: Unusual! It is quite an accomplishment. Matei Zaharia won the ACM
doctoral thesis award for this work.

Q: Should we view Spark as being similar to MapReduce?

A: There are similarities, but Spark can express computations that are
difficult to express in a high-performance way in MapReduce, for
example iterative algorithms. You can think of Spark as MapReduce and
more. The authors argue that Spark is powerful enough to express a
range of prior computation frameworks, including MapReduce (see
section 7.1).

There are systems that are better than Spark in incorporating new data
that is streaming in, instead of doing batch processing. For example,
Naiad (https://dl.acm.org/citation.cfm?id=2522738). Spark also has
streaming support, although it implements it in terms of computations
on "micro-batches" (see Spark Streaming, SOSP 2013).

There are also systems that allow more fine-grained sharing between
different machines (e.g., DSM systems) or a system such as Picolo
(http://piccolo.news.cs.nyu.edu/), which targets similar applications
as Spark.

Q: Why are RDDs called immutable if they allow for transformations?

A: A transformation produces a new RDD, but it may not be materialized
explicitly, because transformations are computed lazily. In the
PageRank example, each loop iteration produces a new distrib and rank
RDD, as shown in the corresponding lineage graph.

Q: Do distributed systems designers worry about energy efficiency?

A: Energy is a big concern in CS in general! But most of the focus in
energy-efficiency in cluster computing goes to the design of data
centers and the design of the computers and cooling inside the data
center.

Chip designers pay lots of attention to energy; for example, your
processor dynamically changes the clock rate to avoid making the
processor too hot.

There is less focus on energy in the design of distributed systems,
mostly, I think, because that is not where the big wins are. But there
is some work, for example see http://www.cs.cmu.edu/~fawnproj/.

Q: How do applications figure out the location of an RDD?

A: The application names RDDs with variable names in Scala. Each RDD
has location information associated with it, in its metadata (see
Table 3). The scheduler uses all this information to colocate
computations with the data. An RDD may be generated by multiple nodes
but it is for different partitions of the RDD.

Q: How does Spark achieve fault tolerance?

A: When persisting an RDD, the programmer can specify that it must be
replicated on a few machines. Spark doesn't need complicated protocols
like Raft, however, because RDDs are immutable and can always be
recomputed using the lineage graph.

Q: Why is Spark developed using Scala? What's special about the language?

A: In part because Scala was new and hip when the project started.

One good reason is that Scala provides the ability to serialize and ship
user-defined code ("closures") as discussed in §5.2.

This is something that is fairly straightforward in JVM-based languages
(such as Java and Scala), but tricky to do in C, C++ or Go, partly
because of shared memory (pointers, mutexes, etc.) and partly because the
closure needs to capture all variables referred to inside it (which is
difficult unless a language runtime can help with it).

Q: Does anybody still use MapReduce rather than Spark, since Spark seems
to be strictly superior? If so, why do people still use MR?

If the computation one needs fits the MapReduce paradigm well, there is
no advantage to using Spark over using MapReduce.

For example if the computation does a single scan of a large data set
(map phase) followed by an aggregation (reduce phase), the computation
will be dominated by I/O and Spark's in-memory RDD caching will offer
no benefit since no RDD is ever re-used.

Spark does very well on iterative computations with a lot of internal
reuse, but has no architectural edge over MapReduce or Hadoop for simple
jobs that scan a large data set and aggregate (i.e., just map() and
reduceByKey() transformations in Spark speak). On the other hand,
there's also no reason that Spark would be slower for these computations,
so you could really use either here.

Q: Is the RDD concept implemented in any systems other than Spark?

Spark and the specific RDD interface are pretty intimately tied to each
other. However, two key ideas behind RDDs -- deterministic, lineage-based
re-execution and the collections-oriented API -- are certainly widely-used
in many systems. For example, DryadLINQ, FlumeJava, and Cloud Dataflow
offer similar collection-oriented APIs; and the Dryad and Ciel systems
referenced by the paper also keep track of how pieces of data are
computed, and re-execute that computation on failure, similar to
lineage-based fault tolerance.

As a matter of fact, RDDs themselves in Spark are now somewhat
deprecated: Spark has recently moved towards something called "DataFrames"
(https://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes),
which implements a more column-oriented representation while maintaining
the good ideas from RDDs.