Thursday 20 August 2015

Learning Spark - Chapter 8 Notes

Configuration precedence for Spark

1) 1st Priority to configuration used with set() in code
2) 2nd Priority to Command line arguments
3) 3rd Priority to Properties File
4) 4th Priority to Default Properties.




Configuration Property
Default Value
Description
spark.executor.memory
(--executor-memory)
-512m
Amount of memory to use per executor process, in the same format as JVM memory strings (e.g., 512m, 2g)
spark.executor.cores

(--executor-cores
--totalexecutor-cores)
1
Configurations for bounding the number of cores used by the application. In YARN mode spark.executor.cores will assign a specific number of cores to each executor. In standalone and Mesos modes, you can upper-bound the total number of cores across all executors using spark.cores.max.
spark.speculation
False
Setting to true will enable speculative execution of tasks. This means tasks that are running slowly will have a second copy launched on another node. Enabling this can help cut down on straggler tasks in large clusters.
spark.storage.
blockManager
TimeoutIntervalMs
45000
An internal timeout used for tracking the liveness of executors. For jobs that have long garbage collection pauses, tuning this to be 100 seconds (a value of 100000) or higher can prevent thrashing. In future versions of Spark this may be replaced with a general timeout setting, so check current documentation.

spark.executor.
extraJavaOptions                           

spark.executor.
extraClassPath

spark.executor.
extraLibraryPath


These three options allow you to customize the launch behavior of executor JVMs. The three flags add extra Java options, classpath entries, or path entries for the JVM library path. These parameters should be specified as strings (e.g., spark.executor.extraJavaOptions="- XX:+PrintGCDetails-XX:+PrintGCTi meStamps"). Note that while this allows you to manually augment the executor classpath, the recommended way to add dependencies is through the --jars flag to spark-submit (not using this option)
spark.serializer
org.apache.spark.serializer.JavaSerializer
Class to use for serializing objects that will be sent over the network or need to be cached in serialized form. The default of Java Serialization works with any serializable Java object but is quite slow, so we recommend using org.apache.spark.seri alizer.KryoSerializer and configuring Kryo serialization when speed is necessary. Can be any subclass of org.apache.spark.Serial izer.
spark.eventLog.enabled
False
Set to true to enable event logging, which allows completed Spark jobs to be viewed using a history server.
spark.eventLog.dir
file:///<file_path>          
The storage location used for event logging, if enabled. This can set to any File System eg:HDFS.





SPARK_LOCAL_DIRS     Local Directory that spark should use for Shuffle. This property cannot be set using the configuration object.

Spark's default cache operation persists using MEMORY_ONLY storage level.

MEMORY_AND_DESK persists RDDs to Memory and if the storage is not sufficient, then drops the old RDDs to DISK and reads them back to memory, when they are needed.


No comments:

Post a Comment