Fold in spark

Fold is a very powerful operation in spark which allows you to calculate many important values in O(n) time. If you are familiar with Scala collection it will be like using fold operation on collection. Even if you not used fold in Scala, this post will make you comfortable in using fold.

Syntax

def fold[T](acc:T)((acc,value) => acc)

The above is kind of high level view of fold api. It has following three things

T is the data type of RDD
acc is accumulator of type T which will be return value of the fold operation
A function , which will be called for each element in rdd with previous accumulator.

Let’s see some examples of fold

Finding max in a given RDD

Let’s first build a RDD

example

scala> import org.apache.spark._
import org.apache.spark._

scala> val employeeData = List(("Jack",1000.0),("Bob",20000.0),("Carl",7000.0))
employeeData: List[(String, Double)] = List((Jack,1000.0), (Bob,20000.0), (Carl,7000.0))

scala> val employeeRDD = sc.makeRDD(employeeData)
employeeRDD: org.apache.spark.rdd.RDD[(String, Double)] = ParallelCollectionRDD[0] at makeRDD at <console>:32

scala> val dummyEmployee = ("dummy",0.0);
dummyEmployee: (String, Double) = (dummy,0.0)

scala> val maxSalaryEmployee = employeeRDD.fold(dummyEmployee)((acc,employee) => {
| if(acc._2 < employee._2) employee else acc})
maxSalaryEmployee: (String, Double) = (Bob,20000.0)

scala> println("employee with maximum salary is"+maxSalaryEmployee)
employee with maximum salary is(Bob,20000.0)

Don't read,understand it

Friday, 10 May 2019

fold in Spark

Fold in spark

Syntax

No comments:

Post a Comment