Thursday, 16 May 2019

basic unix

sed '1d' filename

it will delete the first line of the file.

sed '$d' filename

it will delete the last line of the file.

cat filename

it will show the content of the file

cat -b filename

it will show the content of the file with line number.

cat filename|grep -v 'word'

it will ignore the line which have a word.

sed 's/apple//g' filename

it will empty the file which has a word apple.

grep -r "word" *

search a particular word in a available lines


Note: post your doubts in comment. so that I will help you





Problem1--spark find duplicate records for a field in rdd sparkrddduplicates

I have data set like 10,"Name",2016,"Country" 11,"Name1",2016,"country1" 10,"Name",2016,"Country" 10,"Name",2016,"Country" 12,"Name2",2017,"Country2"
My problem statement is I have to find total count and duplicates count by year . My Result should be (year, totalrecords, duplicates) 2016,4,3 2017,1,0.
I have tried to solve this problem by

  1. val records = rdd.map {
  2. x =>
  3. val array = x.split(",")
  4. (array(2),x)
  5. }.groupByKey()
  6. val duplicates = records.map {
  7. x => val totalcount = x._2.size
  8. val duplicates = // find duplicates in iterator
  9. (x._1,totalcount,duplicates)
  10. }

It is running fine upto 10GB data. If I ran it on more data it is taking long time. I found that groupByKey is not a best approach.
Please suggest best approach to solve this problem.

Friday, 10 May 2019

fold in Spark

Fold in spark

Fold is a very powerful operation in spark which allows you to calculate many important values in O(n) time. If you are familiar with Scala collection it will be like using fold operation on collection. Even if you not used fold in Scala, this post will make you comfortable in using fold.

Syntax

def fold[T](acc:T)((acc,value) => acc)
The above is kind of high level view of fold api. It has following three things
  1. T is the data type of RDD
  2. acc is accumulator of type T which will be return value of the fold operation
  3. A function , which will be called for each element in rdd with previous accumulator.
Let’s see some examples of fold
Finding max in a given RDD
Let’s first build a RDD

example

scala> import org.apache.spark._
import org.apache.spark._

scala> val employeeData = List(("Jack",1000.0),("Bob",20000.0),("Carl",7000.0))
employeeData: List[(String, Double)] = List((Jack,1000.0), (Bob,20000.0), (Carl,7000.0))

scala>  val employeeRDD = sc.makeRDD(employeeData)
employeeRDD: org.apache.spark.rdd.RDD[(String, Double)] = ParallelCollectionRDD[0] at makeRDD at <console>:32

scala> val dummyEmployee = ("dummy",0.0);
dummyEmployee: (String, Double) = (dummy,0.0)

scala> val maxSalaryEmployee = employeeRDD.fold(dummyEmployee)((acc,employee) => {
     | if(acc._2 < employee._2) employee else acc})
maxSalaryEmployee: (String, Double) = (Bob,20000.0)

scala> println("employee with maximum salary is"+maxSalaryEmployee)
employee with maximum salary is(Bob,20000.0)

Monday, 6 May 2019

useful scenario in bigdata

1.how to insert alternative columns of a csv file into spark?


2.see this scenario
Input file:-

name^age^state
swathi^23^us
srivani^24^UK
ram^25^London
scala> case class schema(name:String,age:Int,brand_code:String)
scala> val rdd = sc.textFile("file://<file-path>/test1.csv")
scala> val rdd1= rdd.mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1) else iter }
scala> val df1 = rdd1.map(_.split("\\^")).map(x=> schema(x(0).toString,x(1).toInt,x(2).toString)).toDF()
(or)
scala> val df1 = rdd1.map(_.split('^')).map(x=> schema(x(0).toString,x(1).toInt,x(2).toString)).toDF()
Output:-

scala> df1.show()
+-------+---+----------+
|   name|age|brand_code|
+-------+---+----------+
| swathi| 23|        us|
|srivani| 24|        UK|
|    ram| 25|    London|
+-------+---+----------+


3.how will you add coumn name in a text file?

import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{IntegerType,StringType,StructField,StructType}


val schema =new StructType().add(StructField("name",StringType,true)).add(StructField("age",IntegerType,true)).add(StructField("state",StringType,true))


val data = sc.textFile("/user/206571870/sample.csv")
val header = data.first()
  1. val rdd = data.filter(row => row != header)
  2. val rowsRDD = rdd.map(x => x.split(",")).map(x => Row(x(0),x(1).toInt,x(2)))
  3. val df = sqlContext.createDataFrame(rowsRDD,schema)

or

val df1 = sc.textFile("testfile.txt").Map(_.split('|')).map(x=> schema(x(0).toString,x(1).toInt,x(2).toString)).toDF()





Saturday, 4 May 2019

useful simple scenario scala

rdd.zipWithIndex.filter(_._2==9).map(_._1).first()
The first function transforms the RDD into a pair (value, idx) with idx going from 0 onwards. The second function takes the element with idx==9 (the 10th). The third function takes the original value. Then the result is returned.
The first function could be pulled up by the execution engine and influence the behavior of the whole processing. Give it a try.
In any case, if n is very large, this method is efficient in that it does not require to collect an array of the first n elements in the driver node.


//yet to be modify

Friday, 3 May 2019

basic hadoop

Variations of Hadoop ls Shell Command
$ hadoop fs –ls /
Returns all the available files and subdirectories present under the root directory.
$ hadoop fs –ls –R /user/cloudera
Returns all the available files and recursively lists all the subdirectories under /user/Cloudera
4) mkdir- Used to create a new directory in HDFS at a given location.
Example of HDFS mkdir Command -
$ hadoop fs –mkdir /user/cloudera/sample
The above command will create a new directory named dezyre1 under the location /user/cloudera
Note : Cloudera and other hadoop distribution vendors provide /user/ directory with read/write permission to all users but other directories are available as read-only.Thus, to create a folder in the root directory, users require superuser permission  as shown below -
$ sudo –u hdfs hadoop fs –mkdir /test
This command will create a new directory named dezyre under the / (root directory).

5) copyFromLocal

Copy a file from local filesytem to HDFS location.
For the following examples, we will use Sample.txt file available in the /home/Cloudera location.
xample - $ hadoop fs –copyFromLocal Sample1.txt /user/cloudera/test
Copy/Upload Sample1.txt available in /home/cloudera (local default) to /user/cloudera/dezyre1 (hdfs path)

6) put –

This hadoop command uploads a single file or multiple source files from local file system to hadoop distributed file system (HDFS).
Ex - $ hadoop fs –put Sample2.txt /user/cloudera/test
Copy/Upload Sample2.txt available in /home/cloudera (local default) to /user/cloudera/dezyre1 (hdfs path)

7) moveFromLocal 

This hadoop command functions similar to the put command but the source file will be deleted after copying.
Example - $ hadoop fs –moveFromLocal Sample3.txt /user/cloudera/dezyre1
Move Sample3.txt available in /home/cloudera (local default) to /user/cloudera/dezyre1 (hdfs path). Source file will be deleted after moving.

8) du

Displays the disk usage for all the files available under a given directory.
Example - $ hadoop fs –du /user/cloudera/test

9) df

 Displas disk usage of current hadoop distributed file system.

Example - $ hadoop fs –df

10) Expunge

This HDFS command empties the trash by deleting all the files and directories.
Example - $ hadoop fs –expunge

11) Cat

This is similar to the cat command in Unix and displays the contents of a file.
Example - $ hadoop fs –cat /user/cloudera/test/Sample1.txt

12) cp 

Copy files from one HDFS location to another HDFS location.
Example – $ hadoop fs –cp /user/cloudera/test/war_and_peace /user/cloudera/test/

13) mv 

Move files from one HDFS location to another HDFS location.
Example – $ hadoop fs –mv /user/cloudera/test/Sample1.txt /user/cloudera/tes/

14) rm

Removes the file or directory from the mentioned HDFS location.
Example – $ hadoop fs –rm -r /user/cloudera/test
Example  – $ hadoop fs –rm -r /user/cloudera/test
Deletes or removes the directory and its content from HDFS location in a recursive manner.
Example – $ hadoop fs –rm /user/cloudera/test
  Delete or remove the files from HDFS location.

15) tail 

This hadoop command will show the last kilobyte of the file to stdout.
Example – $ hadoop fs -tail /user/cloudera/test/war_and_peace

16) copyToLocal

Copies the files to the local filesystem . This is similar to hadoop fs -get command but in this case the destination location msut be a local file reference
Example - $ hadoop fs –copyFromLocal /user/cloudera/test/Sample1.txt /home/cloudera/hdfs_bkp/
Copy/Download Sample1.txt available in /user/cloudera/test (hdfs path) to /home/cloudera/hdfs_bkp/ (local path)

17) get

Downloads or Copies the files to the local filesystem.
Example - $ hadoop fs –get /user/cloudera/test/Sample2.txt /home/cloudera/hdfs_bkp/
Copy/Download Sample2.txt available in /user/cloudera/test (hdfs path) to /home/cloudera/hdfs_bkp/ (local path)

18) touchz

Used to create an emplty file at the specified location.
Example - $ hadoop fs –touchz /user/cloudera/test/Sample4.txt
It will create a new empty file Sample4.txt in /user/cloudera/test/ (hdfs path)

19) setrep

This hadoop fs command is used to set the replication for a specific file.
Example - $ hadoop fs –setrep –w 1 /user/cloudera/test/Sample1.txt
It will set the replication factor of Sample1.txt to 1

20) chgrp

This hadoop command is basically used to change the group name.
Example - $ sudo –u hdfs hadoop fs –chgrp –R cloudera /test
It will change the /test directory group membership from supergroup to cloudera (To perform this operation superuser permission is required)

21) chown

This command lets you change both the owner and group name simulataneously.
Example - $ sudo –u hdfs hadoop fs –chown –R cloudera /test
It will change the /dezyre directory ownership from hdfs user to cloudera user (To perform this operation superuser is permission required)

22) hadoop chmod

Used to change the permissions of a given file/dir.
Example - $ hadoop fs –chmod /test
It will change the /dezyre directory permission to 700 (drwx------).






















Interview Preperation remainder

About your project
Spark architechture?
Hive queries,joins
hive manages(internal)table vs external table.
Type of data you have used
Sqoop commands
Different ways of creating dataframe?
Spark initiation
Difference between client mode and cluster mode?








basic sqoop

sqoop import
 --connect jdbs:mysqllocalhost:3000 -username sat -password pass
--table tablename
--incremental-by append
--target-dir location
--last-value somevalue