Wednesday 26 August 2015

Return Types for all important RDD and Pair RDD Functions with Examples

I have collected some important RDD Functions and collected their return types for reference. 

They are as follows:

Taking two Sequences as inputs:

rdd1 = {"sr","sr2","sr3","sr4"}
rdd2 = {"cs1","cs2","cs3","cs4"} 

cartesian = RDD[(T, U)] - Pairs with each element against all the elements of other RDD

eg: rdd1.cartesian(rdd2)
result: RDD[(String, String)]

(sr,cs3)
(sr,cs4)
(sr,cs1)
(sr,cs2)
(sr2,cs1)
(sr2,cs2)
(sr2,cs3)
(sr2,cs4)
(sr3,cs1)
(sr3,cs2)
(sr4,cs1)
(sr4,cs2)
(sr3,cs3)
(sr3,cs4)
(sr4,cs3)
(sr4,cs4)


collect = Array[T]

eg: rdd1.collect()
result: Array(sr, sr2, sr3, sr4)

count = Long

rdd1.count()
Long = 4


countApprox = PartialResult[BoundedDouble]

rdd1.countApprox(1,0.95)
(final: [4.000, 4.000])

countByValue = Map[T, Long]

rdd1.countByValue()
Map(sr3 -> 1, sr -> 1, sr4 -> 1, sr2 -> 1)

dependencies = Seq[spark.Dependency[_]]

glom = RDD[Array[T]]
Returns number of Arrays as many as number of partitions, for each Partition, and with elements in it as Array

group By = RDD[(K, Seq[T])]

rdd1.groupBy(x => x.length)

result:
(2,CompactBuffer(sr))
(3,CompactBuffer(sr2, sr3, sr4))
 


key By = RDD[(K, T)]

rdd1.keyBy(x => x.length)

(2,sr)
(3,sr2)
(3,sr3)
(3,sr4)


paritioner = Option[Partitioner]

partitions = Array[Partition]

zip = RDD[(T, U)] - RDDs with same number of elements can only be zipped to gether. 

rdd1.zip(rdd2)

(sr,cs1)
(sr2,cs2)
(sr3,cs3)
(sr4,cs4)



Pairs RDD Functions and thier Return Types: 

 
p1 = {(1,sr1),(1,sri1),(2,sr2),(3,sr3),(5,sr5)}
p2 = {(2,cs2),(4,cs4),(1,cs1)}

ip1 = {(1,1),(2,2),(3,3),(1,5),(2,6),(3,7)}
ip2 = {(1,6),(2,7),(3,8),(4,9),(1,11)}
 

cogroup : RDD[(K, (Seq[V], Seq[W]))]

p1.cogroup(p2)
(5,(CompactBuffer(sr5),CompactBuffer()))
(3,(CompactBuffer(sr3),CompactBuffer()))
(1,(CompactBuffer(sr1, sri1),CompactBuffer(cs1)))
(4,(CompactBuffer(),CompactBuffer(cs4)))
(2,(CompactBuffer(sr2),CompactBuffer(cs2)))


p1.cogrou(ip1)

(2,(CompactBuffer(sr2),CompactBuffer(2, 6)))
(1,(CompactBuffer(sr1, sri1),CompactBuffer(1, 5)))
(3,(CompactBuffer(sr3),CompactBuffer(3, 7)))
(5,(CompactBuffer(sr5),CompactBuffer()))

p1.cogroup(p2, ip1)
(4,(CompactBuffer(),CompactBuffer(cs4),CompactBuffer()))
(1,(CompactBuffer(sr1, sri1),CompactBuffer(cs1),CompactBuffer(1, 5)))
(3,(CompactBuffer(sr3),CompactBuffer(),CompactBuffer(3, 7)))
(5,(CompactBuffer(sr5),CompactBuffer(),CompactBuffer()))
(2,(CompactBuffer(sr2),CompactBuffer(cs2),CompactBuffer(2, 6)))
 

groupByKey(): RDD[(K, Seq[V])]

(2,CompactBuffer(sr2))
(1,CompactBuffer(sr1, sri1))
(3,CompactBuffer(sr3))
(5,CompactBuffer(sr5))

groupWith - RDD[(K, (Seq[V], Seq[W]))]

p1.groupWith(ip1)
(2,(CompactBuffer(sr2),CompactBuffer(2, 6)))
(1,(CompactBuffer(sr1, sri1),CompactBuffer(1, 5)))
(3,(CompactBuffer(sr3),CompactBuffer(3, 7)))
(5,(CompactBuffer(sr5),CompactBuffer()))


Join  - RDD[(K, (V, W))]

p1.join(p2)

(2,(sr2,cs2))
(1,(sr1,cs1))
(1,(sri1,cs1))

ip1.join(ip2)

(2,(2,7))
(2,(6,7))
(1,(1,6))
(1,(1,11))
(1,(5,6))
(1,(5,11))
(3,(3,8))
(3,(7,8))


Left Outer Join = RDD[(K, (V, Option[W]))]

ip1.leftOuterJoin(ip2)

(1,(1,Some(6)))
(1,(1,Some(11)))
(1,(5,Some(6)))
(1,(5,Some(11)))
(3,(3,Some(8)))
(3,(7,Some(8)))
(2,(2,Some(7)))
(2,(6,Some(7)))

 Right Outer Join = RDD[(K, (Option[V], W))
(1,(Some(1),6))
(1,(Some(1),11))
(1,(Some(5),6))
(1,(Some(5),11))
(3,(Some(3),8))
(3,(Some(7),8))
(4,(None,9))
(2,(Some(2),7))
(2,(Some(6),7))

Look Up = Seq[V]

ip1.lookup(1)
Seq[Int] = WrappedArray(1, 5)

ip1.lookup(4)
Seq[Int] = WrappedArray()

No comments:

Post a Comment