Shared Flashcard Set

Details

Title

Spark - Deployments

Description

Preparing for Spark Developer Certification

Total Cards

112

Subject

Computer Science

Level

Professional

Created

11/10/2015

Click here to study/print these flashcards.

Create your own flash cards! Sign up here.

Additional Computer Science Flashcards

Cards Return to Set Details

Term

What cluster managers does Spark support?

Definition

Mesos, YARN, Standalone

Term

What storage systems does Spark support?

Definition

Any supported by the Hadoop APIs.

Term

Local mode is also known as

Definition

non-distributed mode

Term

How do you change the logging level of Spark?

Definition

via the conf/log4j.properties file

Term

How do you execute a standalone spark application?

Definition

Via spark-submit

Term

How do you set the Spark application's name?

Definition

Through:

- SparkConf::setAppName("...")
- spark-submit --name "..."

Term

Can you run a Spark application without setting any SparkConf properties?

Definition

Yes

Term

What pseudocode creates an accumulator with initial value of 1 and adds 2 to it?

Definition

sc.accumulator(1)
.add(2)

Term

What method is used to obtain an accumulator's value?

Definition

.value

Term

What guarantees does Spark provide with respect to applying updates to accumulators?

Definition

Updates to accumulators in actions are only applied once. There is no guarantee for updates in transformations.

Term

Executors exist on ____

Definition

Nodes

Term

Executors run ____

Definition

Tasks

Term

If your application has high network traffic, what might you adjust?

Definition

Partitioning

Term

When can partitioning provide performance benefits?

Definition

When a cached dataset would be shuffled by key (i.e. reused) multiple times in key-oriented applications - such as joins.

Term

On what is partitioning available?

Definition

Pair RDDs

Term

How can partitioning provide performance benefits?

Definition

By ensuring a static dataset is only hashed once, assuming that dataset is cached and reused.

Term

What is the number of partitions an upper bound for?

Definition

The number of tasks / degree of parallelism

Term

What is a good guideline for the number of partitions you should have for an RDD?

Definition

At least as large as the number of cores in your cluster.

Term

If an operation modifies a single RDD and that RDD is partitioned and cached, what data is transferred over the network?

Definition

Only the output of the operation. The input is operated upon locally.

Term

What partitioner will be selected for an operation with two RDDs that sets a partitioner?

Definition

It depends on whether those RDDs have partitioners set.

1. If not, a Hash Partitioner is used with the level of parallelism defined by the operation

2. If only one parent RDD has a partitioner, that partitioner is used

3. If both parent RDDs have a partitioner, the partitioner of the first parent is used.

Term

In local mode, how many processes does a Spark application have?

Definition

One - the driver and a single executor run in the same process.

Term

In distributed mode, how many processes does a Spark application have?

Definition

One for the driver, and one for each executor.

Term

On encountering an action, what does Spark's scheduler do?

Definition

Create a physical execution plan working backward from the final RDD being computed.

Term

What is a "stage"?

Definition

One or more transformations / action that are divided into N tasks, where N is the number of partitions.

Term

What is pipelining?

Definition

When multiple transformations/action are combined into a single stage

Term

In the simplest case, how many stages will a Spark application have?

Definition

One for each transformation and action

Term

When is pipelining performed?

Definition

When an RDD can be computed from its parents without any data movement

Term

When is data shuffling avoided?

Definition

When the necessary shuffle output is in a persisted RDD, or is still written to disk.

Term

In what order are Spark stages executed?

Definition

In the order defined by RDD lineage

Term

How does spark handle loss of a persisted RDD?

Definition

It determines what is necessary to calculate that RDD through the lineage graph, and then recalculates it.

Term

Why shouldn't we use collect() on large datasets?

Definition

Because the entire dataset will have to fit in the Driver program's memory, which may not be feasible.

Term

What can you do to mitigate slowdowns due to lost persisted RDDs?

Definition

Replicate persisted data to multiple nodes

Term

What persistence options are available?

Definition

MEMORY_ONLY
MEMORY_ONLY_SER
MEMORY_AND_DISK
MEMORY_AND_DISK_SER
DISK_ONLY
OFF_HEAP

Term

What happens if you try to persist more data than you have memory for?

Definition

Spark will evict partitions based on a LRU policy. If a MEMORY_AND_DISK policy is used, the contents will spill over to disk, otherwise they will just be recomputed the next time they are needed.

Term

How to you remove data from the cache?

Definition

Call the .unpersist() method on the RDD

Term

What can hinder performance when using broadcast variables?

Definition

Serialization overhead

Term

Where can external program scripts be loaded from?

Definition

Local file system, any Hadoop supported file system, HTTP, HTTPS, or FTP

Term

How can we set environment variables for external scripts?

Definition

Pass them in as a map to the second argument of the pipe() command.

Term

When are object files commonly used?

Definition

To save Spark job data to be used by other code/jobs

Term

What is a danger of using object files?

Definition

It requires programmer effort to maintain backwards compatibility when changing serialized classes

Term

What is a caveat when serializing an RDD of Writable objects?

Definition

Many Writable objects are not serializable. You may need to use a map function to unwrap them before serializing.

Term

What is a caveat when caching an RDD of Writable objects?

Definition

Caching an RDD of writables can fail due to it using the same RecordReader instance. Use a map prior to caching.

Term

What type of Hadoop formats generally support compression?

Definition

File system

Term

What happens if Spark reads data from a source with an unsplittable compression?

Definition

Spark reads all the data on a single node.

Term

How does textFile handle sources with splittable compression?

Definition

textFile ignores splittable alltogether

Term

If you want to read a text file with a splittable compression, what should you do?

Definition

Use Hadoop API commands directly and specify the code

Term

How does spark handle reading from local file systems?

Definition

All files must be at the same path on all cluster nodes

Term

What is a URI Spark would recognize for an S3 object?

Definition

s3n://bucket/key

Term

What is a URI Spark would recognize for an HDFS object?

Definition

hdfs://master:port/path

Term

What happens after a user submits an application with spark-submit?

Definition

1. spark-submit launches the driver program and invokes the main method
2. The driver program requests the cluster manager ask for resources to launch executors
3. The cluster manage launches executors
4. The driver process runs, sending tasks to executors based on transformations/actions
5. Executors run tasks and save/transmit results
6. On exit, the executors are terminated and the cluster manager resources are released

Term

When are executors terminated and cluster manager resources released by a Spark application?

Definition

When the driver's main() method exits or when SparkContext.stop() is called

Term

In what mode does the following execute?

spark-submit my_script.py

Definition

Local mode

Term

How would you run myjob.jar locally with 2 cores?

Definition

spark-submit --master local[2] myjob.jar

Term

How would you run myjob.jar locally with the maximum number of cores?

Definition

spark-submit --master local[*] myjob.jar

Term

How would you run myjob.jar on a yarn cluster?

Definition

spark-submit --master yarn

Additionally, set the HADOOP_CONF_DIR environment variable to the location of your Hadoop configuration directory

Term

How would you run myjob.jar on a mesos cluster?

Definition

spark-submit --master mesos://host:port myjob.jar

Term

How would you run myjob.jar on a standalone cluster?

Definition

spark-submit --master spark://host:port myjob.jar

Term

Where does the driver program run when spark-submit is executed?

Definition

By default, it will run on the machine where spark-submit is executed (client mode). To instead run it on one of the worker nodes, use:

--deploy-mode cluster

Term

How do you set the name of your application via spark-submit?

Definition

spark-submit --name "..."

Term

How do you put JAR files on the classpath and transmit those JARs to the cluster nodes?

Definition

spark-submit --jars jar,jar,...

Term

How do you put non-jar files in the working directory of your application for each cluster node?

Definition

spark-submit --files file,file,...

Term

How do you put python files on the PYTHONPATH of the application?

Definition

--py-files *.py,*.egg,*.zip

Term

How do you specify 512 megabytes of executor memory for your application?

Definition

spark-submit --executor-memory 512m

Term

How do you specify 5 gigabytes of driver memory for your application?

Definition

spark-submit --driver-memory 5g

Term

How can you provide arbitrary configuration properties via spark-submit?

Definition

spark-submit --conf prop=value --conf prop=value ...

Term

How can you provide a properties file via spark-submit?

Definition

spark-submit --properties-file

Term

If you have many library dependencies, what is an alternative to providing them via spark-submit?

Definition

Create a fat jar containing all dependencies, typically using a build tool.

Term

What shouldn't you include as a dependency in a fat jar?

Definition

Spark itself

Term

What primarily governs resource sharing between (inter) Spark applications?

Definition

The Cluster Manager

Term

If a Spark application asks for 5 executors, how many is it guaranteed to get?

Definition

No guarantee. It may receive fewer, or more, depending on availability and contention in the cluster.

Term

What governs resource sharing within a long lived Spark application (intra)?

Definition

Spark's internal scheduler

Term

What does Spark's Fair Scheduler provide?

Definition

Applications can define priority queues for tasks

Term

How do most cluster managers handle scheduling between jobs?

Definition

By defining priority queues and/or capacity limits for jobs

Term

When is the standalone cluster manager appropriate?

Definition

When you only want Spark to run on the cluster.

Term

What are the steps to stand up a standalone cluster?

Definition

1. Put spark at the same location on all cluster nodes
2. Enable password-less SSH access between the cluster nodes
3. Add the worker hostnames to the master's conf/slaves
4. Run sbin/start-all.sh on the master

Term

On what port does a standalone cluster run on by default?

Definition

7077

Term

How does spark handle a request for more memory than an executor node has available?

Definition

It does not add that executor node to the cluster.

Term

If you have a *standalone*, 20-node cluster with 4-core machines, how can you limit your job to running on eight machines?

Definition

spark-submit --total-executor-cores 8

Term

For multiple applications how many executors will run on a single node in a *standalone* cluster?

Definition

By default no more than 1 per application

Term

If you have a *standalone* 20-node cluster with 4-core machines, how can you have 8 executors running on as few nodes as possible?

Definition

Set spark.deploy.spreadOut to false

Term

Can you have multiple masters with a *standalone* cluster?

Definition

Yes, via Zookeeper

Term

By default, how many executors are used for a YARN application?

Definition

Term

How can you change the number of executors a application will launch?

Definition

spark-submit --num-executors

Term

How do you set the number of cores each executor will use in a YARN cluster?

Definition

spark-submit --executor-cores ...

Term

How can you submit a Spark application to a specific YARN cluster queue?

Definition

spark-submit --queue ...

Term

How can you use Zookeeper to elect a master node in a Mesos cluster?

Definition

spark-submit --master mesos://zk://node1:port/mesos,node2:port/mesos,...

Term

What modes does Mesos offer for scheduling and which is the default?

Definition

Fine-grained (default)
Coarse-grained (set via spark.mesos.coarse=true)

Term

What is Mesos' fine-grained mode?

Definition

Dynamically scales the number of CPUs executors claim to share cluster resources among multiple jobs as they come and go.

Term

When wouldn't you want to use Mesos' fine-grained mode?

Definition

For applications with high latency sensitivity (e.g., Spark streaming)

Term

What is Mesos' coarse-grained mode?

Definition

Spark allocates a fixed number of CPUs to each executor which are not released until the application ends.

Term

At what level is Mesos' scheduling mode set (per-cluster/per-job)

Definition

per-job

Term

How many cores will Mesos use in the cluster by default?

Definition

All of them

Term

How can you set a limit for the number of cores Mesos will use?

Definition

spark-submit --total-executor-cores

Term

How can you start a standalone cluster in EC2?

Definition

spark-ec2

Term

What does the Spark script for launching a EC2 cluster also put on the nodes?

Definition

Ephemeral HDFS
Persistent HDFS
Tachyon
Ganglia

Term

What config property sets the application name?

Definition

spark.app.name

Term

What config property sets the master?

Definition

spark.master

Term

What file does Spark read properties from by default? How is it overridden?

Definition

SPARK_HOME/conf/spark-defaults.conf

spark-submit --properties-file

Term

What is the order of precedence for how Spark loads properties?

Definition

1. Properties set via user code (highest)
2. Properties file
3. Default properties (lowest)

Term

How would you provide Java options to executors using spark-submit?

Definition

spark-submit --conf "spark.executor.extraJavaOptions=..."

Term

How would you provide library paths to executors using spark-submit?

Definition

spark-submit --conf "spark.executor.extraLibraryPath=..."

Term

How would you specify the local storage directories for executors?

Definition

Standalone/Mesos: SPARK_LOCAL_DIRS environment variable, fallback to spark.local.dir property (csv)

YARN: LOCAL_DIRS environment variable or fallback spark.local.dir property (csv)

Term

Where are shuffle outputs written?

Definition

Disk

Term

What is "skew"?

Definition

When a small number of tasks take a large amount of time

Term

Where are Spark's logs stored and accessed in Standalone and Mesos?

Definition

Standalone: stored in work/ dir on workers. Displayed in master's web interface

Mesos: Stored in work/ dir on mesos slaves, accessed via mesos master UI

Term

How would you view application logs in YARN?

Definition

yarn logs -applicationId ...

Term

How can you easily provide a log4j.properties to your spark application?

Definition

spark-submit --files log4j.properties

Term

How do you specify to use the Kryo serializer?

Definition

Set the 'spark.serializer' property to 'org.apache.spark.serializer.KryoSerializer'

Term

What should you consider when using Kryo to serialize your custom classes?

Definition

Term

What can help debug a NotSerializableException?

Definition

Set the java option "-Dsun.io.serialization.extendedDebugInfo=true"

Term

How is JVM memory distributed for executors by default? How is it changed?

Definition

Divided between:

- Persisted RDD storage, default 60%, set with spark.storage.memoryFraction
- Shuffle output, 20% soft limit, set with spark.shuffle.memoryFraction
- User code, anything leftover (default 20%)

Term

What is one reason you might cache serialized objects?

Definition

To reduce garbage collection times, which scales by the number of objects on the heap and not the size of objects.

Flashcard Machine - create, study and share online flash cards

Shared Flashcard Set

Details

Additional Computer Science Flashcards

Cards Return to Set Details

My Flashcards

Flashcard Library

Browse

About

Help

Mobile