Fold in spark
Fold is a very powerful operation in spark which allows you to calculate many important values in O(n) time. If you are familiar with Scala collection it will be like using fold operation on collection. Even if you not used fold in Scala, this post will make you comfortable in using fold.
Syntax
def fold[T](acc:T)((acc,value) => acc)
The above is kind of high level view of fold api. It has following three things
- T is the data type of RDD
- acc is accumulator of type T which will be return value of the fold operation
- A function , which will be called for each element in rdd with previous accumulator.
Let’s see some examples of fold
Finding max in a given RDD
Let’s first build a RDD
example
scala> import org.apache.spark._
import org.apache.spark._
scala> val employeeData = List(("Jack",1000.0),("Bob",20000.0),("Carl",7000.0))
employeeData: List[(String, Double)] = List((Jack,1000.0), (Bob,20000.0), (Carl,7000.0))
scala> val employeeRDD = sc.makeRDD(employeeData)
employeeRDD: org.apache.spark.rdd.RDD[(String, Double)] = ParallelCollectionRDD[0] at makeRDD at <console>:32
scala> val dummyEmployee = ("dummy",0.0);
dummyEmployee: (String, Double) = (dummy,0.0)
scala> val maxSalaryEmployee = employeeRDD.fold(dummyEmployee)((acc,employee) => {
| if(acc._2 < employee._2) employee else acc})
maxSalaryEmployee: (String, Double) = (Bob,20000.0)
scala> println("employee with maximum salary is"+maxSalaryEmployee)
employee with maximum salary is(Bob,20000.0)
No comments:
Post a Comment