Thursday, 20 August 2015

Some Key points about Spark Shuffle Operation

What is Shuffle.?

The shuffle is Spark’s mechanism for re-distributing data so that is grouped differently across partitions. This typically involves copying data across executors and machines, making the shuffle a complex and costly operation.


Performance Impact

  • The Shuffle is an expensive operation since it involves disk I/O, data serialization, and network I/O. To organize data for the shuffle, Spark generates sets of tasks - map tasks to organize the data, and a set of reduce tasks to aggregate it. 
  •  Certain shuffle operations can consume significant amounts of heap memory since they employ in-memory data structures to organize records before or after transferring them. Specifically, reduceByKey and aggregateByKey create these structures on the map side and 'ByKey operations generate these on the reduce side. When data does not fit in memory Spark will spill these tables to disk, incurring the additional overhead of disk I/O and increased garbage collection.
  •  Shuffle also generates a large number of intermediate files on disk. As of Spark 1.3, these files are not cleaned up from Spark’s temporary storage until Spark is stopped, which means that long-running Spark jobs may consume available disk space. This is done so the shuffle doesn’t need to be re-computed if the lineage is re-computed. The temporary storage directory is specified by the spark.local.dir configuration parameter when configuring the Spark context.
  • Operations that involve Shuffle 
                      repartition, coalesce, By Key Operations Except Count By Key, Join, Co Group

  
ReduceByKey vs. GroupByKey

Reduce By Key is more efficient than Group By Key, as Reduce involves grouping operations at each partition. A concept similar to Hadoop's Combiner


Word Count example to understand the difference between Reduce By Key and Group By Key.



Reduce By Key:  The reduce operation happens for each partition, and then the results will be shuffled.



Group By Key: Each and every element gets suffled first, and then grouping operation happens. 

Conclusion: Reduce By Key might not fit or solve all the Group By Operations, but wherever possible, it is better to opt for Reduce By Key, Fold By Key or Combine By Key, instead of Group By.

3 comments:

  1. Thank you sir good explanation .

    ReplyDelete
  2. LuckyLand Casino: $2,500 Welcome Bonus + 150 free spins
    Get the most out of LuckyLand Casino when you sign up today to 대전광역 출장샵 enjoy great bonuses 전라북도 출장안마 and 창원 출장마사지 promotions. 성남 출장샵 Sign up today 안양 출장마사지 to enjoy exciting promotions at

    ReplyDelete