Term
| What cluster managers does Spark support? |
|
Definition
|
|
Term
| What storage systems does Spark support? |
|
Definition
| Any supported by the Hadoop APIs. |
|
|
Term
| Local mode is also known as |
|
Definition
|
|
Term
| How do you change the logging level of Spark? |
|
Definition
| via the conf/log4j.properties file |
|
|
Term
| How do you execute a standalone spark application? |
|
Definition
|
|
Term
| How do you set the Spark application's name? |
|
Definition
Through:
- SparkConf::setAppName("...") - spark-submit --name "..." |
|
|
Term
| Can you run a Spark application without setting any SparkConf properties? |
|
Definition
|
|
Term
| What pseudocode creates an accumulator with initial value of 1 and adds 2 to it? |
|
Definition
sc.accumulator(1) .add(2) |
|
|
Term
| What method is used to obtain an accumulator's value? |
|
Definition
|
|
Term
| What guarantees does Spark provide with respect to applying updates to accumulators? |
|
Definition
| Updates to accumulators in actions are only applied once. There is no guarantee for updates in transformations. |
|
|
Term
|
Definition
|
|
Term
|
Definition
|
|
Term
| If your application has high network traffic, what might you adjust? |
|
Definition
|
|
Term
| When can partitioning provide performance benefits? |
|
Definition
| When a cached dataset would be shuffled by key (i.e. reused) multiple times in key-oriented applications - such as joins. |
|
|
Term
| On what is partitioning available? |
|
Definition
|
|
Term
| How can partitioning provide performance benefits? |
|
Definition
| By ensuring a static dataset is only hashed once, assuming that dataset is cached and reused. |
|
|
Term
| What is the number of partitions an upper bound for? |
|
Definition
| The number of tasks / degree of parallelism |
|
|
Term
| What is a good guideline for the number of partitions you should have for an RDD? |
|
Definition
| At least as large as the number of cores in your cluster. |
|
|
Term
| If an operation modifies a single RDD and that RDD is partitioned and cached, what data is transferred over the network? |
|
Definition
| Only the output of the operation. The input is operated upon locally. |
|
|
Term
| What partitioner will be selected for an operation with two RDDs that sets a partitioner? |
|
Definition
It depends on whether those RDDs have partitioners set.
1. If not, a Hash Partitioner is used with the level of parallelism defined by the operation
2. If only one parent RDD has a partitioner, that partitioner is used
3. If both parent RDDs have a partitioner, the partitioner of the first parent is used. |
|
|
Term
| In local mode, how many processes does a Spark application have? |
|
Definition
| One - the driver and a single executor run in the same process. |
|
|
Term
| In distributed mode, how many processes does a Spark application have? |
|
Definition
| One for the driver, and one for each executor. |
|
|
Term
| On encountering an action, what does Spark's scheduler do? |
|
Definition
| Create a physical execution plan working backward from the final RDD being computed. |
|
|
Term
|
Definition
| One or more transformations / action that are divided into N tasks, where N is the number of partitions. |
|
|
Term
|
Definition
| When multiple transformations/action are combined into a single stage |
|
|
Term
| In the simplest case, how many stages will a Spark application have? |
|
Definition
| One for each transformation and action |
|
|
Term
| When is pipelining performed? |
|
Definition
| When an RDD can be computed from its parents without any data movement |
|
|
Term
| When is data shuffling avoided? |
|
Definition
| When the necessary shuffle output is in a persisted RDD, or is still written to disk. |
|
|
Term
| In what order are Spark stages executed? |
|
Definition
| In the order defined by RDD lineage |
|
|
Term
| How does spark handle loss of a persisted RDD? |
|
Definition
| It determines what is necessary to calculate that RDD through the lineage graph, and then recalculates it. |
|
|
Term
| Why shouldn't we use collect() on large datasets? |
|
Definition
| Because the entire dataset will have to fit in the Driver program's memory, which may not be feasible. |
|
|
Term
| What can you do to mitigate slowdowns due to lost persisted RDDs? |
|
Definition
| Replicate persisted data to multiple nodes |
|
|
Term
| What persistence options are available? |
|
Definition
MEMORY_ONLY MEMORY_ONLY_SER MEMORY_AND_DISK MEMORY_AND_DISK_SER DISK_ONLY OFF_HEAP |
|
|
Term
| What happens if you try to persist more data than you have memory for? |
|
Definition
| Spark will evict partitions based on a LRU policy. If a MEMORY_AND_DISK policy is used, the contents will spill over to disk, otherwise they will just be recomputed the next time they are needed. |
|
|
Term
| How to you remove data from the cache? |
|
Definition
| Call the .unpersist() method on the RDD |
|
|
Term
| What can hinder performance when using broadcast variables? |
|
Definition
|
|
Term
| Where can external program scripts be loaded from? |
|
Definition
| Local file system, any Hadoop supported file system, HTTP, HTTPS, or FTP |
|
|
Term
| How can we set environment variables for external scripts? |
|
Definition
| Pass them in as a map to the second argument of the pipe() command. |
|
|
Term
| When are object files commonly used? |
|
Definition
| To save Spark job data to be used by other code/jobs |
|
|
Term
| What is a danger of using object files? |
|
Definition
| It requires programmer effort to maintain backwards compatibility when changing serialized classes |
|
|
Term
| What is a caveat when serializing an RDD of Writable objects? |
|
Definition
| Many Writable objects are not serializable. You may need to use a map function to unwrap them before serializing. |
|
|
Term
| What is a caveat when caching an RDD of Writable objects? |
|
Definition
| Caching an RDD of writables can fail due to it using the same RecordReader instance. Use a map prior to caching. |
|
|
Term
| What type of Hadoop formats generally support compression? |
|
Definition
|
|
Term
| What happens if Spark reads data from a source with an unsplittable compression? |
|
Definition
| Spark reads all the data on a single node. |
|
|
Term
| How does textFile handle sources with splittable compression? |
|
Definition
| textFile ignores splittable alltogether |
|
|
Term
| If you want to read a text file with a splittable compression, what should you do? |
|
Definition
| Use Hadoop API commands directly and specify the code |
|
|
Term
| How does spark handle reading from local file systems? |
|
Definition
| All files must be at the same path on all cluster nodes |
|
|
Term
| What is a URI Spark would recognize for an S3 object? |
|
Definition
|
|
Term
| What is a URI Spark would recognize for an HDFS object? |
|
Definition
|
|
Term
| What happens after a user submits an application with spark-submit? |
|
Definition
1. spark-submit launches the driver program and invokes the main method 2. The driver program requests the cluster manager ask for resources to launch executors 3. The cluster manage launches executors 4. The driver process runs, sending tasks to executors based on transformations/actions 5. Executors run tasks and save/transmit results 6. On exit, the executors are terminated and the cluster manager resources are released |
|
|
Term
| When are executors terminated and cluster manager resources released by a Spark application? |
|
Definition
| When the driver's main() method exits or when SparkContext.stop() is called |
|
|
Term
In what mode does the following execute?
spark-submit my_script.py |
|
Definition
|
|
Term
| How would you run myjob.jar locally with 2 cores? |
|
Definition
| spark-submit --master local[2] myjob.jar |
|
|
Term
| How would you run myjob.jar locally with the maximum number of cores? |
|
Definition
| spark-submit --master local[*] myjob.jar |
|
|
Term
| How would you run myjob.jar on a yarn cluster? |
|
Definition
spark-submit --master yarn
Additionally, set the HADOOP_CONF_DIR environment variable to the location of your Hadoop configuration directory |
|
|
Term
| How would you run myjob.jar on a mesos cluster? |
|
Definition
| spark-submit --master mesos://host:port myjob.jar |
|
|
Term
| How would you run myjob.jar on a standalone cluster? |
|
Definition
| spark-submit --master spark://host:port myjob.jar |
|
|
Term
| Where does the driver program run when spark-submit is executed? |
|
Definition
By default, it will run on the machine where spark-submit is executed (client mode). To instead run it on one of the worker nodes, use:
--deploy-mode cluster |
|
|
Term
| How do you set the name of your application via spark-submit? |
|
Definition
| spark-submit --name "..." |
|
|
Term
| How do you put JAR files on the classpath and transmit those JARs to the cluster nodes? |
|
Definition
| spark-submit --jars jar,jar,... |
|
|
Term
| How do you put non-jar files in the working directory of your application for each cluster node? |
|
Definition
| spark-submit --files file,file,... |
|
|
Term
| How do you put python files on the PYTHONPATH of the application? |
|
Definition
| --py-files *.py,*.egg,*.zip |
|
|
Term
| How do you specify 512 megabytes of executor memory for your application? |
|
Definition
| spark-submit --executor-memory 512m |
|
|
Term
| How do you specify 5 gigabytes of driver memory for your application? |
|
Definition
| spark-submit --driver-memory 5g |
|
|
Term
| How can you provide arbitrary configuration properties via spark-submit? |
|
Definition
| spark-submit --conf prop=value --conf prop=value ... |
|
|
Term
| How can you provide a properties file via spark-submit? |
|
Definition
| spark-submit --properties-file |
|
|
Term
| If you have many library dependencies, what is an alternative to providing them via spark-submit? |
|
Definition
| Create a fat jar containing all dependencies, typically using a build tool. |
|
|
Term
| What shouldn't you include as a dependency in a fat jar? |
|
Definition
|
|
Term
| What primarily governs resource sharing between (inter) Spark applications? |
|
Definition
|
|
Term
| If a Spark application asks for 5 executors, how many is it guaranteed to get? |
|
Definition
| No guarantee. It may receive fewer, or more, depending on availability and contention in the cluster. |
|
|
Term
| What governs resource sharing within a long lived Spark application (intra)? |
|
Definition
| Spark's internal scheduler |
|
|
Term
| What does Spark's Fair Scheduler provide? |
|
Definition
| Applications can define priority queues for tasks |
|
|
Term
| How do most cluster managers handle scheduling between jobs? |
|
Definition
| By defining priority queues and/or capacity limits for jobs |
|
|
Term
| When is the standalone cluster manager appropriate? |
|
Definition
| When you only want Spark to run on the cluster. |
|
|
Term
| What are the steps to stand up a standalone cluster? |
|
Definition
1. Put spark at the same location on all cluster nodes 2. Enable password-less SSH access between the cluster nodes 3. Add the worker hostnames to the master's conf/slaves 4. Run sbin/start-all.sh on the master |
|
|
Term
| On what port does a standalone cluster run on by default? |
|
Definition
|
|
Term
| How does spark handle a request for more memory than an executor node has available? |
|
Definition
| It does not add that executor node to the cluster. |
|
|
Term
| If you have a *standalone*, 20-node cluster with 4-core machines, how can you limit your job to running on eight machines? |
|
Definition
| spark-submit --total-executor-cores 8 |
|
|
Term
| For multiple applications how many executors will run on a single node in a *standalone* cluster? |
|
Definition
| By default no more than 1 per application |
|
|
Term
| If you have a *standalone* 20-node cluster with 4-core machines, how can you have 8 executors running on as few nodes as possible? |
|
Definition
| Set spark.deploy.spreadOut to false |
|
|
Term
| Can you have multiple masters with a *standalone* cluster? |
|
Definition
|
|
Term
| By default, how many executors are used for a YARN application? |
|
Definition
|
|
Term
| How can you change the number of executors a application will launch? |
|
Definition
| spark-submit --num-executors |
|
|
Term
| How do you set the number of cores each executor will use in a YARN cluster? |
|
Definition
| spark-submit --executor-cores ... |
|
|
Term
| How can you submit a Spark application to a specific YARN cluster queue? |
|
Definition
|
|
Term
| How can you use Zookeeper to elect a master node in a Mesos cluster? |
|
Definition
| spark-submit --master mesos://zk://node1:port/mesos,node2:port/mesos,... |
|
|
Term
| What modes does Mesos offer for scheduling and which is the default? |
|
Definition
Fine-grained (default) Coarse-grained (set via spark.mesos.coarse=true) |
|
|
Term
| What is Mesos' fine-grained mode? |
|
Definition
| Dynamically scales the number of CPUs executors claim to share cluster resources among multiple jobs as they come and go. |
|
|
Term
| When wouldn't you want to use Mesos' fine-grained mode? |
|
Definition
| For applications with high latency sensitivity (e.g., Spark streaming) |
|
|
Term
| What is Mesos' coarse-grained mode? |
|
Definition
| Spark allocates a fixed number of CPUs to each executor which are not released until the application ends. |
|
|
Term
| At what level is Mesos' scheduling mode set (per-cluster/per-job) |
|
Definition
|
|
Term
| How many cores will Mesos use in the cluster by default? |
|
Definition
|
|
Term
| How can you set a limit for the number of cores Mesos will use? |
|
Definition
| spark-submit --total-executor-cores |
|
|
Term
| How can you start a standalone cluster in EC2? |
|
Definition
|
|
Term
| What does the Spark script for launching a EC2 cluster also put on the nodes? |
|
Definition
Ephemeral HDFS Persistent HDFS Tachyon Ganglia |
|
|
Term
| What config property sets the application name? |
|
Definition
|
|
Term
| What config property sets the master? |
|
Definition
|
|
Term
| What file does Spark read properties from by default? How is it overridden? |
|
Definition
SPARK_HOME/conf/spark-defaults.conf
spark-submit --properties-file |
|
|
Term
| What is the order of precedence for how Spark loads properties? |
|
Definition
1. Properties set via user code (highest) 2. Properties file 3. Default properties (lowest) |
|
|
Term
| How would you provide Java options to executors using spark-submit? |
|
Definition
| spark-submit --conf "spark.executor.extraJavaOptions=..." |
|
|
Term
| How would you provide library paths to executors using spark-submit? |
|
Definition
| spark-submit --conf "spark.executor.extraLibraryPath=..." |
|
|
Term
| How would you specify the local storage directories for executors? |
|
Definition
Standalone/Mesos: SPARK_LOCAL_DIRS environment variable, fallback to spark.local.dir property (csv)
YARN: LOCAL_DIRS environment variable or fallback spark.local.dir property (csv) |
|
|
Term
| Where are shuffle outputs written? |
|
Definition
|
|
Term
|
Definition
| When a small number of tasks take a large amount of time |
|
|
Term
| Where are Spark's logs stored and accessed in Standalone and Mesos? |
|
Definition
Standalone: stored in work/ dir on workers. Displayed in master's web interface
Mesos: Stored in work/ dir on mesos slaves, accessed via mesos master UI |
|
|
Term
| How would you view application logs in YARN? |
|
Definition
| yarn logs -applicationId ... |
|
|
Term
| How can you easily provide a log4j.properties to your spark application? |
|
Definition
| spark-submit --files log4j.properties |
|
|
Term
| How do you specify to use the Kryo serializer? |
|
Definition
| Set the 'spark.serializer' property to 'org.apache.spark.serializer.KryoSerializer' |
|
|
Term
| What should you consider when using Kryo to serialize your custom classes? |
|
Definition
Register those classes with Kryo to save space via:
conf.registerKryoClasses(Array(classOf[...], ...)) |
|
|
Term
| What can help debug a NotSerializableException? |
|
Definition
| Set the java option "-Dsun.io.serialization.extendedDebugInfo=true" |
|
|
Term
| How is JVM memory distributed for executors by default? How is it changed? |
|
Definition
Divided between:
- Persisted RDD storage, default 60%, set with spark.storage.memoryFraction - Shuffle output, 20% soft limit, set with spark.shuffle.memoryFraction - User code, anything leftover (default 20%) |
|
|
Term
| What is one reason you might cache serialized objects? |
|
Definition
| To reduce garbage collection times, which scales by the number of objects on the heap and not the size of objects. |
|
|