Don't read,understand it: Problem1--spark find duplicate records for a field in rdd sparkrddduplicates

Thursday, 16 May 2019

Problem1--spark find duplicate records for a field in rdd sparkrddduplicates

I have data set like 10,"Name",2016,"Country" 11,"Name1",2016,"country1" 10,"Name",2016,"Country" 10,"Name",2016,"Country" 12,"Name2",2017,"Country2"

My problem statement is I have to find total count and duplicates count by year . My Result should be (year, totalrecords, duplicates) 2016,4,3 2017,1,0.

I have tried to solve this problem by

val records = rdd.map {
x =>
val array = x.split(",")
(array(2),x)
}.groupByKey()
val duplicates = records.map {
x => val totalcount = x._2.size
val duplicates = // find duplicates in iterator
(x._1,totalcount,duplicates)
}

It is running fine upto 10GB data. If I ran it on more data it is taking long time. I found that groupByKey is not a best approach.

Please suggest best approach to solve this problem.

Don't read,understand it

Thursday, 16 May 2019

Problem1--spark find duplicate records for a field in rdd sparkrddduplicates

No comments:

Post a Comment