Thursday, 16 May 2019

Problem1--spark find duplicate records for a field in rdd sparkrddduplicates

I have data set like 10,"Name",2016,"Country" 11,"Name1",2016,"country1" 10,"Name",2016,"Country" 10,"Name",2016,"Country" 12,"Name2",2017,"Country2"
My problem statement is I have to find total count and duplicates count by year . My Result should be (year, totalrecords, duplicates) 2016,4,3 2017,1,0.
I have tried to solve this problem by

  1. val records = rdd.map {
  2. x =>
  3. val array = x.split(",")
  4. (array(2),x)
  5. }.groupByKey()
  6. val duplicates = records.map {
  7. x => val totalcount = x._2.size
  8. val duplicates = // find duplicates in iterator
  9. (x._1,totalcount,duplicates)
  10. }

It is running fine upto 10GB data. If I ran it on more data it is taking long time. I found that groupByKey is not a best approach.
Please suggest best approach to solve this problem.

No comments:

Post a Comment