Collection of few very interesting things that i have faced in my big data journey: Learning Spark

Configuration precedence for Spark

1) 1st Priority to configuration used with set() in code
2) 2nd Priority to Command line arguments
3) 3rd Priority to Properties File
4) 4th Priority to Default Properties.

Configuration Property	Default Value	Description
spark.executor.memory (--executor-memory)	-512m	Amount of memory to use per executor process, in the same format as JVM memory strings (e.g., 512m, 2g)
spark.executor.cores (--executor-cores --totalexecutor-cores)	1	Configurations for bounding the number of cores used by the application. In YARN mode spark.executor.cores will assign a specific number of cores to each executor. In standalone and Mesos modes, you can upper-bound the total number of cores across all executors using spark.cores.max.
spark.speculation	False	Setting to true will enable speculative execution of tasks. This means tasks that are running slowly will have a second copy launched on another node. Enabling this can help cut down on straggler tasks in large clusters.
spark.storage. blockManager TimeoutIntervalMs	45000	An internal timeout used for tracking the liveness of executors. For jobs that have long garbage collection pauses, tuning this to be 100 seconds (a value of 100000) or higher can prevent thrashing. In future versions of Spark this may be replaced with a general timeout setting, so check current documentation.
spark.executor. extraJavaOptions spark.executor. extraClassPath spark.executor. extraLibraryPath		These three options allow you to customize the launch behavior of executor JVMs. The three flags add extra Java options, classpath entries, or path entries for the JVM library path. These parameters should be specified as strings (e.g., spark.executor.extraJavaOptions="- XX:+PrintGCDetails-XX:+PrintGCTi meStamps"). Note that while this allows you to manually augment the executor classpath, the recommended way to add dependencies is through the --jars flag to spark-submit (not using this option)
spark.serializer	org.apache.spark.serializer.JavaSerializer	Class to use for serializing objects that will be sent over the network or need to be cached in serialized form. The default of Java Serialization works with any serializable Java object but is quite slow, so we recommend using org.apache.spark.seri alizer.KryoSerializer and configuring Kryo serialization when speed is necessary. Can be any subclass of org.apache.spark.Serial izer.
spark.eventLog.enabled	False	Set to true to enable event logging, which allows completed Spark jobs to be viewed using a history server.
spark.eventLog.dir	file:///<file_path>	The storage location used for event logging, if enabled. This can set to any File System eg:HDFS.

SPARK_LOCAL_DIRS Local Directory that spark should use for Shuffle. This property cannot be set using the configuration object.

Spark's default cache operation persists using MEMORY_ONLY storage level.

MEMORY_AND_DESK persists RDDs to Memory and if the storage is not sufficient, then drops the old RDDs to DISK and reads them back to memory, when they are needed.

Collection of few very interesting things that i have faced in my big data journey

Thursday, 20 August 2015

Learning Spark - Chapter 8 Notes

No comments:

Post a Comment

About Me