Collection of few very interesting things that i have faced in my big data journey: Some Key points about Spark Shuffle Operation

What is Shuffle.?

The shuffle is Spark’s mechanism for re-distributing data so that is grouped differently across partitions. This typically involves copying data across executors and machines, making the shuffle a complex and costly operation.

Performance Impact

The Shuffle is an expensive operation since it involves disk I/O, data serialization, and network I/O. To organize data for the shuffle, Spark generates sets of tasks - map tasks to organize the data, and a set of reduce tasks to aggregate it.

Certain shuffle operations can consume significant amounts of heap memory since they employ in-memory data structures to organize records before or after transferring them. Specifically, reduceByKey and aggregateByKey create these structures on the map side and 'ByKey operations generate these on the reduce side. When data does not fit in memory Spark will spill these tables to disk, incurring the additional overhead of disk I/O and increased garbage collection.

Shuffle also generates a large number of intermediate files on disk. As of Spark 1.3, these files are not cleaned up from Spark’s temporary storage until Spark is stopped, which means that long-running Spark jobs may consume available disk space. This is done so the shuffle doesn’t need to be re-computed if the lineage is re-computed. The temporary storage directory is specified by the spark.local.dir configuration parameter when configuring the Spark context.
Operations that involve Shuffle

repartition, coalesce, By Key Operations Except Count By Key, Join, Co Group

ReduceByKey vs. GroupByKey

Reduce By Key is more efficient than Group By Key, as Reduce involves grouping operations at each partition. A concept similar to Hadoop's Combiner

Word Count example to understand the difference between Reduce By Key and Group By Key.

Reduce By Key: The reduce operation happens for each partition, and then the results will be shuffled.

Group By Key: Each and every element gets suffled first, and then grouping operation happens.

Conclusion: Reduce By Key might not fit or solve all the Group By Operations, but wherever possible, it is better to opt for Reduce By Key, Fold By Key or Combine By Key, instead of Group By.

Collection of few very interesting things that i have faced in my big data journey

Thursday, 20 August 2015

Some Key points about Spark Shuffle Operation

Performance Impact

3 comments:

About Me