Shared Flashcard Set

Details

Spark - Deployments
Preparing for Spark Developer Certification
112
Computer Science
Professional
11/10/2015

Additional Computer Science Flashcards

 


 

Cards

Term
What cluster managers does Spark support?
Definition
Mesos, YARN, Standalone
Term
What storage systems does Spark support?
Definition
Any supported by the Hadoop APIs.
Term
Local mode is also known as
Definition
non-distributed mode
Term
How do you change the logging level of Spark?
Definition
via the conf/log4j.properties file
Term
How do you execute a standalone spark application?
Definition
Via spark-submit
Term
How do you set the Spark application's name?
Definition
Through:

- SparkConf::setAppName("...")
- spark-submit --name "..."
Term
Can you run a Spark application without setting any SparkConf properties?
Definition
Yes
Term
What pseudocode creates an accumulator with initial value of 1 and adds 2 to it?
Definition
sc.accumulator(1)
.add(2)
Term
What method is used to obtain an accumulator's value?
Definition
.value
Term
What guarantees does Spark provide with respect to applying updates to accumulators?
Definition
Updates to accumulators in actions are only applied once. There is no guarantee for updates in transformations.
Term
Executors exist on ____
Definition
Nodes
Term
Executors run ____
Definition
Tasks
Term
If your application has high network traffic, what might you adjust?
Definition
Partitioning
Term
When can partitioning provide performance benefits?
Definition
When a cached dataset would be shuffled by key (i.e. reused) multiple times in key-oriented applications - such as joins.
Term
On what is partitioning available?
Definition
Pair RDDs
Term
How can partitioning provide performance benefits?
Definition
By ensuring a static dataset is only hashed once, assuming that dataset is cached and reused.
Term
What is the number of partitions an upper bound for?
Definition
The number of tasks / degree of parallelism
Term
What is a good guideline for the number of partitions you should have for an RDD?
Definition
At least as large as the number of cores in your cluster.
Term
If an operation modifies a single RDD and that RDD is partitioned and cached, what data is transferred over the network?
Definition
Only the output of the operation. The input is operated upon locally.
Term
What partitioner will be selected for an operation with two RDDs that sets a partitioner?
Definition
It depends on whether those RDDs have partitioners set.

1. If not, a Hash Partitioner is used with the level of parallelism defined by the operation

2. If only one parent RDD has a partitioner, that partitioner is used

3. If both parent RDDs have a partitioner, the partitioner of the first parent is used.
Term
In local mode, how many processes does a Spark application have?
Definition
One - the driver and a single executor run in the same process.
Term
In distributed mode, how many processes does a Spark application have?
Definition
One for the driver, and one for each executor.
Term
On encountering an action, what does Spark's scheduler do?
Definition
Create a physical execution plan working backward from the final RDD being computed.
Term
What is a "stage"?
Definition
One or more transformations / action that are divided into N tasks, where N is the number of partitions.
Term
What is pipelining?
Definition
When multiple transformations/action are combined into a single stage
Term
In the simplest case, how many stages will a Spark application have?
Definition
One for each transformation and action
Term
When is pipelining performed?
Definition
When an RDD can be computed from its parents without any data movement
Term
When is data shuffling avoided?
Definition
When the necessary shuffle output is in a persisted RDD, or is still written to disk.
Term
In what order are Spark stages executed?
Definition
In the order defined by RDD lineage
Term
How does spark handle loss of a persisted RDD?
Definition
It determines what is necessary to calculate that RDD through the lineage graph, and then recalculates it.
Term
Why shouldn't we use collect() on large datasets?
Definition
Because the entire dataset will have to fit in the Driver program's memory, which may not be feasible.
Term
What can you do to mitigate slowdowns due to lost persisted RDDs?
Definition
Replicate persisted data to multiple nodes
Term
What persistence options are available?
Definition
MEMORY_ONLY
MEMORY_ONLY_SER
MEMORY_AND_DISK
MEMORY_AND_DISK_SER
DISK_ONLY
OFF_HEAP
Term
What happens if you try to persist more data than you have memory for?
Definition
Spark will evict partitions based on a LRU policy. If a MEMORY_AND_DISK policy is used, the contents will spill over to disk, otherwise they will just be recomputed the next time they are needed.
Term
How to you remove data from the cache?
Definition
Call the .unpersist() method on the RDD
Term
What can hinder performance when using broadcast variables?
Definition
Serialization overhead
Term
Where can external program scripts be loaded from?
Definition
Local file system, any Hadoop supported file system, HTTP, HTTPS, or FTP
Term
How can we set environment variables for external scripts?
Definition
Pass them in as a map to the second argument of the pipe() command.
Term
When are object files commonly used?
Definition
To save Spark job data to be used by other code/jobs
Term
What is a danger of using object files?
Definition
It requires programmer effort to maintain backwards compatibility when changing serialized classes
Term
What is a caveat when serializing an RDD of Writable objects?
Definition
Many Writable objects are not serializable. You may need to use a map function to unwrap them before serializing.
Term
What is a caveat when caching an RDD of Writable objects?
Definition
Caching an RDD of writables can fail due to it using the same RecordReader instance. Use a map prior to caching.
Term
What type of Hadoop formats generally support compression?
Definition
File system
Term
What happens if Spark reads data from a source with an unsplittable compression?
Definition
Spark reads all the data on a single node.
Term
How does textFile handle sources with splittable compression?
Definition
textFile ignores splittable alltogether
Term
If you want to read a text file with a splittable compression, what should you do?
Definition
Use Hadoop API commands directly and specify the code
Term
How does spark handle reading from local file systems?
Definition
All files must be at the same path on all cluster nodes
Term
What is a URI Spark would recognize for an S3 object?
Definition
s3n://bucket/key
Term
What is a URI Spark would recognize for an HDFS object?
Definition
hdfs://master:port/path
Term
What happens after a user submits an application with spark-submit?
Definition
1. spark-submit launches the driver program and invokes the main method
2. The driver program requests the cluster manager ask for resources to launch executors
3. The cluster manage launches executors
4. The driver process runs, sending tasks to executors based on transformations/actions
5. Executors run tasks and save/transmit results
6. On exit, the executors are terminated and the cluster manager resources are released
Term
When are executors terminated and cluster manager resources released by a Spark application?
Definition
When the driver's main() method exits or when SparkContext.stop() is called
Term
In what mode does the following execute?

spark-submit my_script.py
Definition
Local mode
Term
How would you run myjob.jar locally with 2 cores?
Definition
spark-submit --master local[2] myjob.jar
Term
How would you run myjob.jar locally with the maximum number of cores?
Definition
spark-submit --master local[*] myjob.jar
Term
How would you run myjob.jar on a yarn cluster?
Definition
spark-submit --master yarn

Additionally, set the HADOOP_CONF_DIR environment variable to the location of your Hadoop configuration directory
Term
How would you run myjob.jar on a mesos cluster?
Definition
spark-submit --master mesos://host:port myjob.jar
Term
How would you run myjob.jar on a standalone cluster?
Definition
spark-submit --master spark://host:port myjob.jar
Term
Where does the driver program run when spark-submit is executed?
Definition
By default, it will run on the machine where spark-submit is executed (client mode). To instead run it on one of the worker nodes, use:

--deploy-mode cluster
Term
How do you set the name of your application via spark-submit?
Definition
spark-submit --name "..."
Term
How do you put JAR files on the classpath and transmit those JARs to the cluster nodes?
Definition
spark-submit --jars jar,jar,...
Term
How do you put non-jar files in the working directory of your application for each cluster node?
Definition
spark-submit --files file,file,...
Term
How do you put python files on the PYTHONPATH of the application?
Definition
--py-files *.py,*.egg,*.zip
Term
How do you specify 512 megabytes of executor memory for your application?
Definition
spark-submit --executor-memory 512m
Term
How do you specify 5 gigabytes of driver memory for your application?
Definition
spark-submit --driver-memory 5g
Term
How can you provide arbitrary configuration properties via spark-submit?
Definition
spark-submit --conf prop=value --conf prop=value ...
Term
How can you provide a properties file via spark-submit?
Definition
spark-submit --properties-file
Term
If you have many library dependencies, what is an alternative to providing them via spark-submit?
Definition
Create a fat jar containing all dependencies, typically using a build tool.
Term
What shouldn't you include as a dependency in a fat jar?
Definition
Spark itself
Term
What primarily governs resource sharing between (inter) Spark applications?
Definition
The Cluster Manager
Term
If a Spark application asks for 5 executors, how many is it guaranteed to get?
Definition
No guarantee. It may receive fewer, or more, depending on availability and contention in the cluster.
Term
What governs resource sharing within a long lived Spark application (intra)?
Definition
Spark's internal scheduler
Term
What does Spark's Fair Scheduler provide?
Definition
Applications can define priority queues for tasks
Term
How do most cluster managers handle scheduling between jobs?
Definition
By defining priority queues and/or capacity limits for jobs
Term
When is the standalone cluster manager appropriate?
Definition
When you only want Spark to run on the cluster.
Term
What are the steps to stand up a standalone cluster?
Definition
1. Put spark at the same location on all cluster nodes
2. Enable password-less SSH access between the cluster nodes
3. Add the worker hostnames to the master's conf/slaves
4. Run sbin/start-all.sh on the master
Term
On what port does a standalone cluster run on by default?
Definition
7077
Term
How does spark handle a request for more memory than an executor node has available?
Definition
It does not add that executor node to the cluster.
Term
If you have a *standalone*, 20-node cluster with 4-core machines, how can you limit your job to running on eight machines?
Definition
spark-submit --total-executor-cores 8
Term
For multiple applications how many executors will run on a single node in a *standalone* cluster?
Definition
By default no more than 1 per application
Term
If you have a *standalone* 20-node cluster with 4-core machines, how can you have 8 executors running on as few nodes as possible?
Definition
Set spark.deploy.spreadOut to false
Term
Can you have multiple masters with a *standalone* cluster?
Definition
Yes, via Zookeeper
Term
By default, how many executors are used for a YARN application?
Definition
2
Term
How can you change the number of executors a application will launch?
Definition
spark-submit --num-executors
Term
How do you set the number of cores each executor will use in a YARN cluster?
Definition
spark-submit --executor-cores ...
Term
How can you submit a Spark application to a specific YARN cluster queue?
Definition
spark-submit --queue ...
Term
How can you use Zookeeper to elect a master node in a Mesos cluster?
Definition
spark-submit --master mesos://zk://node1:port/mesos,node2:port/mesos,...
Term
What modes does Mesos offer for scheduling and which is the default?
Definition
Fine-grained (default)
Coarse-grained (set via spark.mesos.coarse=true)
Term
What is Mesos' fine-grained mode?
Definition
Dynamically scales the number of CPUs executors claim to share cluster resources among multiple jobs as they come and go.
Term
When wouldn't you want to use Mesos' fine-grained mode?
Definition
For applications with high latency sensitivity (e.g., Spark streaming)
Term
What is Mesos' coarse-grained mode?
Definition
Spark allocates a fixed number of CPUs to each executor which are not released until the application ends.
Term
At what level is Mesos' scheduling mode set (per-cluster/per-job)
Definition
per-job
Term
How many cores will Mesos use in the cluster by default?
Definition
All of them
Term
How can you set a limit for the number of cores Mesos will use?
Definition
spark-submit --total-executor-cores
Term
How can you start a standalone cluster in EC2?
Definition
spark-ec2
Term
What does the Spark script for launching a EC2 cluster also put on the nodes?
Definition
Ephemeral HDFS
Persistent HDFS
Tachyon
Ganglia
Term
What config property sets the application name?
Definition
spark.app.name
Term
What config property sets the master?
Definition
spark.master
Term
What file does Spark read properties from by default? How is it overridden?
Definition
SPARK_HOME/conf/spark-defaults.conf

spark-submit --properties-file
Term
What is the order of precedence for how Spark loads properties?
Definition
1. Properties set via user code (highest)
2. Properties file
3. Default properties (lowest)
Term
How would you provide Java options to executors using spark-submit?
Definition
spark-submit --conf "spark.executor.extraJavaOptions=..."
Term
How would you provide library paths to executors using spark-submit?
Definition
spark-submit --conf "spark.executor.extraLibraryPath=..."
Term
How would you specify the local storage directories for executors?
Definition
Standalone/Mesos: SPARK_LOCAL_DIRS environment variable, fallback to spark.local.dir property (csv)

YARN: LOCAL_DIRS environment variable or fallback spark.local.dir property (csv)
Term
Where are shuffle outputs written?
Definition
Disk
Term
What is "skew"?
Definition
When a small number of tasks take a large amount of time
Term
Where are Spark's logs stored and accessed in Standalone and Mesos?
Definition
Standalone: stored in work/ dir on workers. Displayed in master's web interface

Mesos: Stored in work/ dir on mesos slaves, accessed via mesos master UI
Term
How would you view application logs in YARN?
Definition
yarn logs -applicationId ...
Term
How can you easily provide a log4j.properties to your spark application?
Definition
spark-submit --files log4j.properties
Term
How do you specify to use the Kryo serializer?
Definition
Set the 'spark.serializer' property to 'org.apache.spark.serializer.KryoSerializer'
Term
What should you consider when using Kryo to serialize your custom classes?
Definition
Register those classes with Kryo to save space via:

conf.registerKryoClasses(Array(classOf[...], ...))
Term
What can help debug a NotSerializableException?
Definition
Set the java option "-Dsun.io.serialization.extendedDebugInfo=true"
Term
How is JVM memory distributed for executors by default? How is it changed?
Definition
Divided between:

- Persisted RDD storage, default 60%, set with spark.storage.memoryFraction
- Shuffle output, 20% soft limit, set with spark.shuffle.memoryFraction
- User code, anything leftover (default 20%)
Term
What is one reason you might cache serialized objects?
Definition
To reduce garbage collection times, which scales by the number of objects on the heap and not the size of objects.
Supporting users have an ad free experience!