Configuration precedence for Spark
1) 1st Priority to configuration used with set() in code
2) 2nd Priority to Command line arguments
3) 3rd Priority to Properties File
4) 4th Priority to Default Properties.
SPARK_LOCAL_DIRS Local Directory that spark should use for Shuffle. This property cannot be set using the configuration object.
Spark's default cache operation persists using MEMORY_ONLY storage level.
MEMORY_AND_DESK persists RDDs to Memory and if the storage is not sufficient, then drops the old RDDs to DISK and reads them back to memory, when they are needed.
1) 1st Priority to configuration used with set() in code
2) 2nd Priority to Command line arguments
3) 3rd Priority to Properties File
4) 4th Priority to Default Properties.
Configuration Property
|
Default Value
|
Description
|
spark.executor.memory
(--executor-memory)
|
-512m
|
Amount
of memory to use per executor process, in the same format as JVM memory
strings (e.g., 512m, 2g)
|
spark.executor.cores
(--executor-cores
--totalexecutor-cores)
|
1
|
Configurations
for bounding the number of cores used by the application. In YARN mode
spark.executor.cores will assign a specific number of cores to each executor.
In standalone and Mesos modes, you can upper-bound the total number of cores
across all executors using spark.cores.max.
|
spark.speculation
|
False
|
Setting
to true will enable speculative execution of tasks. This means tasks that are
running slowly will have a second copy launched on another node. Enabling
this can help cut down on straggler tasks in large clusters.
|
spark.storage.
blockManager
TimeoutIntervalMs
|
45000
|
An
internal timeout used for tracking the liveness of executors. For jobs that
have long garbage collection pauses, tuning this to be 100 seconds (a value
of 100000) or higher can prevent thrashing. In future versions of Spark this
may be replaced with a general timeout setting, so check current
documentation.
|
spark.executor.
extraJavaOptions
spark.executor.
extraClassPath
spark.executor.
extraLibraryPath
|
These
three options allow you to customize the launch behavior of executor JVMs.
The three flags add extra Java options, classpath entries, or path entries
for the JVM library path. These parameters should be specified as strings
(e.g., spark.executor.extraJavaOptions="-
XX:+PrintGCDetails-XX:+PrintGCTi meStamps"). Note that while this allows
you to manually augment the executor classpath, the recommended way to add
dependencies is through the --jars flag to spark-submit (not using this
option)
|
|
spark.serializer
|
org.apache.spark.serializer.JavaSerializer
|
Class
to use for serializing objects that will be sent over the network or need to
be cached in serialized form. The default of Java Serialization works with
any serializable Java object but is quite slow, so we recommend using
org.apache.spark.seri alizer.KryoSerializer and configuring Kryo
serialization when speed is necessary. Can be any subclass of
org.apache.spark.Serial izer.
|
spark.eventLog.enabled
|
False
|
Set
to true to enable event logging, which allows completed Spark jobs to be
viewed using a history server.
|
spark.eventLog.dir
|
file:///<file_path>
|
The
storage location used for event logging, if enabled. This can set to any File
System eg:HDFS.
|
SPARK_LOCAL_DIRS Local Directory that spark should use for Shuffle. This property cannot be set using the configuration object.
Spark's default cache operation persists using MEMORY_ONLY storage level.
MEMORY_AND_DESK persists RDDs to Memory and if the storage is not sufficient, then drops the old RDDs to DISK and reads them back to memory, when they are needed.
No comments:
Post a Comment