I have collected some important RDD Functions and collected their return types for reference.
They are as follows:
Taking two Sequences as inputs:
rdd1 = {"sr","sr2","sr3","sr4"}
rdd2 = {"cs1","cs2","cs3","cs4"}
cartesian = RDD[(T, U)] - Pairs with each element against all the elements of other RDD
eg: rdd1.cartesian(rdd2)
result: RDD[(String, String)]
(sr,cs3)
(sr,cs4)
(sr,cs1)
(sr,cs2)
(sr2,cs1)
(sr2,cs2)
(sr2,cs3)
(sr2,cs4)
(sr3,cs1)
(sr3,cs2)
(sr4,cs1)
(sr4,cs2)
(sr3,cs3)
(sr3,cs4)
(sr4,cs3)
(sr4,cs4)
collect = Array[T]
eg: rdd1.collect()
result: Array(sr, sr2, sr3, sr4)
count = Long
rdd1.count()
Long = 4
countApprox = PartialResult[BoundedDouble]
rdd1.countApprox(1,0.95)
(final: [4.000, 4.000])
countByValue = Map[T, Long]
rdd1.countByValue()
Map(sr3 -> 1, sr -> 1, sr4 -> 1, sr2 -> 1)
dependencies = Seq[spark.Dependency[_]]
glom = RDD[Array[T]]
Returns number of Arrays as many as number of partitions, for each Partition, and with elements in it as Array
group By = RDD[(K, Seq[T])]
rdd1.groupBy(x => x.length)
result:
(2,CompactBuffer(sr))
(3,CompactBuffer(sr2, sr3, sr4))
key By = RDD[(K, T)]
rdd1.keyBy(x => x.length)
(2,sr)
(3,sr2)
(3,sr3)
(3,sr4)
paritioner = Option[Partitioner]
partitions = Array[Partition]
zip = RDD[(T, U)] - RDDs with same number of elements can only be zipped to gether.
rdd1.zip(rdd2)
(sr,cs1)
(sr2,cs2)
(sr3,cs3)
(sr4,cs4)
Pairs RDD Functions and thier Return Types:
p1 = {(1,sr1),(1,sri1),(2,sr2),(3,sr3),(5,sr5)}
p2 = {(2,cs2),(4,cs4),(1,cs1)}
ip1 = {(1,1),(2,2),(3,3),(1,5),(2,6),(3,7)}
ip2 = {(1,6),(2,7),(3,8),(4,9),(1,11)}
cogroup : RDD[(K, (Seq[V], Seq[W]))]
p1.cogroup(p2)
(5,(CompactBuffer(sr5),CompactBuffer()))
(3,(CompactBuffer(sr3),CompactBuffer()))
(1,(CompactBuffer(sr1, sri1),CompactBuffer(cs1)))
(4,(CompactBuffer(),CompactBuffer(cs4)))
(2,(CompactBuffer(sr2),CompactBuffer(cs2)))
p1.cogrou(ip1)
(2,(CompactBuffer(sr2),CompactBuffer(2, 6)))
(1,(CompactBuffer(sr1, sri1),CompactBuffer(1, 5)))
(3,(CompactBuffer(sr3),CompactBuffer(3, 7)))
(5,(CompactBuffer(sr5),CompactBuffer()))
p1.cogroup(p2, ip1)
(4,(CompactBuffer(),CompactBuffer(cs4),CompactBuffer()))
(1,(CompactBuffer(sr1, sri1),CompactBuffer(cs1),CompactBuffer(1, 5)))
(3,(CompactBuffer(sr3),CompactBuffer(),CompactBuffer(3, 7)))
(5,(CompactBuffer(sr5),CompactBuffer(),CompactBuffer()))
(2,(CompactBuffer(sr2),CompactBuffer(cs2),CompactBuffer(2, 6)))
groupByKey(): RDD[(K, Seq[V])]
(2,CompactBuffer(sr2))
(1,CompactBuffer(sr1, sri1))
(3,CompactBuffer(sr3))
(5,CompactBuffer(sr5))
groupWith - RDD[(K, (Seq[V], Seq[W]))]
p1.groupWith(ip1)
(2,(CompactBuffer(sr2),CompactBuffer(2, 6)))
(1,(CompactBuffer(sr1, sri1),CompactBuffer(1, 5)))
(3,(CompactBuffer(sr3),CompactBuffer(3, 7)))
(5,(CompactBuffer(sr5),CompactBuffer()))
Join - RDD[(K, (V, W))]
p1.join(p2)
(2,(sr2,cs2))
(1,(sr1,cs1))
(1,(sri1,cs1))
ip1.join(ip2)
(2,(2,7))
(2,(6,7))
(1,(1,6))
(1,(1,11))
(1,(5,6))
(1,(5,11))
(3,(3,8))
(3,(7,8))
Left Outer Join = RDD[(K, (V, Option[W]))]
ip1.leftOuterJoin(ip2)
(1,(1,Some(6)))
(1,(1,Some(11)))
(1,(5,Some(6)))
(1,(5,Some(11)))
(3,(3,Some(8)))
(3,(7,Some(8)))
(2,(2,Some(7)))
(2,(6,Some(7)))
Right Outer Join = RDD[(K, (Option[V], W))
(1,(Some(1),6))
(1,(Some(1),11))
(1,(Some(5),6))
(1,(Some(5),11))
(3,(Some(3),8))
(3,(Some(7),8))
(4,(None,9))
(2,(Some(2),7))
(2,(Some(6),7))
Look Up = Seq[V]
ip1.lookup(1)
Seq[Int] = WrappedArray(1, 5)
ip1.lookup(4)
Seq[Int] = WrappedArray()