I have data set like 10,"Name",2016,"Country" 11,"Name1",2016,"country1" 10,"Name",2016,"Country" 10,"Name",2016,"Country" 12,"Name2",2017,"Country2"
My problem statement is I have to find total count and duplicates count by year . My Result should be (year, totalrecords, duplicates) 2016,4,3 2017,1,0.
I have tried to solve this problem by
val records = rdd.map {
x =>
val array = x.split(",")
(array(2),x)
}.groupByKey()
val duplicates = records.map {
x => val totalcount = x._2.size
val duplicates = // find duplicates in iterator
(x._1,totalcount,duplicates)
}
It is running fine upto 10GB data. If I ran it on more data it is taking long time. I found that groupByKey is not a best approach.
Please suggest best approach to solve this problem.
No comments:
Post a Comment