Tuesday, 24 September 2019

some other useful blogs for Spark

http://apachesparkbook.blogspot.com

How to connect impala through shell?

coming soon

How to connect impala?

impala-shell -k -i clustername:portno -d database

how to execute hive query in terminal?

hive -S -e "use database"
hive -S -e "query";

how to see the log if a job fails or see log for particular job?

check the oozie link for the particular job

how to see table of a particular user?

Note: post your doubts in comment. so that I will help you

Thursday, 16 May 2019

basic unix

sed '1d' filename

it will delete the first line of the file.

sed '$d' filename

it will delete the last line of the file.

cat filename

it will show the content of the file

cat -b filename

it will show the content of the file with line number.

cat filename|grep -v 'word'

it will ignore the line which have a word.

sed 's/apple//g' filename

it will empty the file which has a word apple.

grep -r "word" *

search a particular word in a available lines

Note: post your doubts in comment. so that I will help you

Problem1--spark find duplicate records for a field in rdd sparkrddduplicates

I have data set like 10,"Name",2016,"Country" 11,"Name1",2016,"country1" 10,"Name",2016,"Country" 10,"Name",2016,"Country" 12,"Name2",2017,"Country2"

My problem statement is I have to find total count and duplicates count by year . My Result should be (year, totalrecords, duplicates) 2016,4,3 2017,1,0.

I have tried to solve this problem by

val records = rdd.map {
x =>
val array = x.split(",")
(array(2),x)
}.groupByKey()
val duplicates = records.map {
x => val totalcount = x._2.size
val duplicates = // find duplicates in iterator
(x._1,totalcount,duplicates)
}

It is running fine upto 10GB data. If I ran it on more data it is taking long time. I found that groupByKey is not a best approach.

Please suggest best approach to solve this problem.

Friday, 10 May 2019

fold in Spark

Fold in spark

Fold is a very powerful operation in spark which allows you to calculate many important values in O(n) time. If you are familiar with Scala collection it will be like using fold operation on collection. Even if you not used fold in Scala, this post will make you comfortable in using fold.

Syntax

def fold[T](acc:T)((acc,value) => acc)

The above is kind of high level view of fold api. It has following three things

T is the data type of RDD
acc is accumulator of type T which will be return value of the fold operation
A function , which will be called for each element in rdd with previous accumulator.

Let’s see some examples of fold

Finding max in a given RDD

Let’s first build a RDD

example

scala> import org.apache.spark._
import org.apache.spark._

scala> val employeeData = List(("Jack",1000.0),("Bob",20000.0),("Carl",7000.0))
employeeData: List[(String, Double)] = List((Jack,1000.0), (Bob,20000.0), (Carl,7000.0))

scala> val employeeRDD = sc.makeRDD(employeeData)
employeeRDD: org.apache.spark.rdd.RDD[(String, Double)] = ParallelCollectionRDD[0] at makeRDD at <console>:32

scala> val dummyEmployee = ("dummy",0.0);
dummyEmployee: (String, Double) = (dummy,0.0)

scala> val maxSalaryEmployee = employeeRDD.fold(dummyEmployee)((acc,employee) => {
| if(acc._2 < employee._2) employee else acc})
maxSalaryEmployee: (String, Double) = (Bob,20000.0)

scala> println("employee with maximum salary is"+maxSalaryEmployee)
employee with maximum salary is(Bob,20000.0)

Monday, 6 May 2019

useful scenario in bigdata

1.how to insert alternative columns of a csv file into spark?

2.see this scenario
Input file:-

name^age^state
swathi^23^us
srivani^24^UK
ram^25^London
scala> case class schema(name:String,age:Int,brand_code:String)
scala> val rdd = sc.textFile("file://<file-path>/test1.csv")
scala> val rdd1= rdd.mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1) else iter }
scala> val df1 = rdd1.map(_.split("\\^")).map(x=> schema(x(0).toString,x(1).toInt,x(2).toString)).toDF()
(or)
scala> val df1 = rdd1.map(_.split('^')).map(x=> schema(x(0).toString,x(1).toInt,x(2).toString)).toDF()
Output:-

scala> df1.show()
+-------+---+----------+
| name|age|brand_code|
+-------+---+----------+
| swathi| 23| us|
|srivani| 24| UK|
| ram| 25| London|
+-------+---+----------+

3.how will you add coumn name in a text file?

import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{IntegerType,StringType,StructField,StructType}

val schema =new StructType().add(StructField("name",StringType,true)).add(StructField("age",IntegerType,true)).add(StructField("state",StringType,true))

val data = sc.textFile("/user/206571870/sample.csv")
val header = data.first()

val rdd = data.filter(row => row != header)
val rowsRDD = rdd.map(x => x.split(",")).map(x => Row(x(0),x(1).toInt,x(2)))
val df = sqlContext.createDataFrame(rowsRDD,schema)

or

val df1 = sc.textFile("testfile.txt").Map(_.split('|')).map(x=> schema(x(0).toString,x(1).toInt,x(2).toString)).toDF()

Saturday, 4 May 2019

useful simple scenario scala

rdd.zipWithIndex.filter(_._2==9).map(_._1).first()

The first function transforms the RDD into a pair (value, idx) with idx going from 0 onwards. The second function takes the element with idx==9 (the 10th). The third function takes the original value. Then the result is returned.

The first function could be pulled up by the execution engine and influence the behavior of the whole processing. Give it a try.

In any case, if n is very large, this method is efficient in that it does not require to collect an array of the first n elements in the driver node.

//yet to be modify

Friday, 3 May 2019

basic hadoop

Variations of Hadoop ls Shell Command

$ hadoop fs –ls /

Returns all the available files and subdirectories present under the root directory.

$ hadoop fs –ls –R /user/cloudera

Returns all the available files and recursively lists all the subdirectories under /user/Cloudera

4) mkdir- Used to create a new directory in HDFS at a given location.

Example of HDFS mkdir Command -

$ hadoop fs –mkdir /user/cloudera/sample

The above command will create a new directory named dezyre1 under the location /user/cloudera

Note : Cloudera and other hadoop distribution vendors provide /user/ directory with read/write permission to all users but other directories are available as read-only.Thus, to create a folder in the root directory, users require superuser permission as shown below -

$ sudo –u hdfs hadoop fs –mkdir /test

This command will create a new directory named dezyre under the / (root directory).

5) copyFromLocal

Copy a file from local filesytem to HDFS location.

For the following examples, we will use Sample.txt file available in the /home/Cloudera location.

xample - $ hadoop fs –copyFromLocal Sample1.txt /user/cloudera/test

Copy/Upload Sample1.txt available in /home/cloudera (local default) to /user/cloudera/dezyre1 (hdfs path)

6) put –

This hadoop command uploads a single file or multiple source files from local file system to hadoop distributed file system (HDFS).

Ex - $ hadoop fs –put Sample2.txt /user/cloudera/test

Copy/Upload Sample2.txt available in /home/cloudera (local default) to /user/cloudera/dezyre1 (hdfs path)

7) moveFromLocal

This hadoop command functions similar to the put command but the source file will be deleted after copying.

Example - $ hadoop fs –moveFromLocal Sample3.txt /user/cloudera/dezyre1

Move Sample3.txt available in /home/cloudera (local default) to /user/cloudera/dezyre1 (hdfs path). Source file will be deleted after moving.

8) du

Displays the disk usage for all the files available under a given directory.

Example - $ hadoop fs –du /user/cloudera/test

9) df

Displas disk usage of current hadoop distributed file system.

Example - $ hadoop fs –df

10) Expunge

This HDFS command empties the trash by deleting all the files and directories.

Example - $ hadoop fs –expunge

11) Cat

This is similar to the cat command in Unix and displays the contents of a file.

Example - $ hadoop fs –cat /user/cloudera/test/Sample1.txt

12) cp

Copy files from one HDFS location to another HDFS location.

Example – $ hadoop fs –cp /user/cloudera/test/war_and_peace /user/cloudera/test/

13) mv

Move files from one HDFS location to another HDFS location.

Example – $ hadoop fs –mv /user/cloudera/test/Sample1.txt /user/cloudera/tes/

14) rm

Removes the file or directory from the mentioned HDFS location.

Example – $ hadoop fs –rm -r /user/cloudera/test

Deletes or removes the directory and its content from HDFS location in a recursive manner.

Example – $ hadoop fs –rm /user/cloudera/test

Delete or remove the files from HDFS location.

15) tail

This hadoop command will show the last kilobyte of the file to stdout.

Example – $ hadoop fs -tail /user/cloudera/test/war_and_peace

16) copyToLocal

Copies the files to the local filesystem . This is similar to hadoop fs -get command but in this case the destination location msut be a local file reference

Example - $ hadoop fs –copyFromLocal /user/cloudera/test/Sample1.txt /home/cloudera/hdfs_bkp/

Copy/Download Sample1.txt available in /user/cloudera/test (hdfs path) to /home/cloudera/hdfs_bkp/ (local path)

17) get

Downloads or Copies the files to the local filesystem.

Example - $ hadoop fs –get /user/cloudera/test/Sample2.txt /home/cloudera/hdfs_bkp/

Copy/Download Sample2.txt available in /user/cloudera/test (hdfs path) to /home/cloudera/hdfs_bkp/ (local path)

18) touchz

Used to create an emplty file at the specified location.

Example - $ hadoop fs –touchz /user/cloudera/test/Sample4.txt

It will create a new empty file Sample4.txt in /user/cloudera/test/ (hdfs path)

19) setrep

This hadoop fs command is used to set the replication for a specific file.

Example - $ hadoop fs –setrep –w 1 /user/cloudera/test/Sample1.txt

It will set the replication factor of Sample1.txt to 1

20) chgrp

This hadoop command is basically used to change the group name.

Example - $ sudo –u hdfs hadoop fs –chgrp –R cloudera /test

It will change the /test directory group membership from supergroup to cloudera (To perform this operation superuser permission is required)

21) chown

This command lets you change both the owner and group name simulataneously.

Example - $ sudo –u hdfs hadoop fs –chown –R cloudera /test

It will change the /dezyre directory ownership from hdfs user to cloudera user (To perform this operation superuser is permission required)

22) hadoop chmod

Used to change the permissions of a given file/dir.

Example - $ hadoop fs –chmod /test

It will change the /dezyre directory permission to 700 (drwx------).

Interview Preperation remainder

About your project
Spark architechture?
Hive queries,joins
hive manages(internal)table vs external table.
Type of data you have used
Sqoop commands
Different ways of creating dataframe?
Spark initiation
Difference between client mode and cluster mode?

basic sqoop

sqoop import
--connect jdbs:mysqllocalhost:3000 -username sat -password pass
--table tablename
--incremental-by append
--target-dir location
--last-value somevalue

Monday, 29 April 2019

basic hive

1.Give the command to see the indexes on a table.
SHOW INDEX ON table_name

This will list all the indexes created on any of the columns in the table table_name.

2.How do you specify the table creator name when creating a table in Hive?
The TBLPROPERTIES clause is used to add the creator name while creating a table.

The TBLPROPERTIES is added like −

TBLPROPERTIES(‘creator’= ‘Joan’)

3.how you specify the hive transactional properties?
TBLPROPERTIES(‘Transactional’= ‘True’)

4.How will you create hive table?
create table tablename(id int,name string,age int) row format delimited fields terminated by ',';

5.simple scenario for date format conversion

create table mytable (dt_tm string);

insert into mytable values
('2/3/2017 23:37')
,('2/3/2017 23:37')
,('2/3/2017 23:40')
,('2/3/2017 23:50')
,('2/3/2017 23:51')
,('2/3/2017 23:53')
,('2/3/2017 23:55')
,('2/4/2017 0:08' )
,('2/4/2017 0:57' )
;

select dt_tm
,cast(from_unixtime(unix_timestamp(dt_tm,'dd/MM/yyyy HH:mm'),'yyyy-MM-dd 00:00:00') as timestamp)

from mytable
;

+----------------+---------------------+
| 2/3/2017 23:37 | 2017-03-02 00:00:00 |
| 2/3/2017 23:37 | 2017-03-02 00:00:00 |
| 2/3/2017 23:40 | 2017-03-02 00:00:00 |
| 2/3/2017 23:50 | 2017-03-02 00:00:00 |
| 2/3/2017 23:51 | 2017-03-02 00:00:00 |
| 2/3/2017 23:53 | 2017-03-02 00:00:00 |
| 2/3/2017 23:55 | 2017-03-02 00:00:00 |
| 2/4/2017 0:08 | 2017-04-02 00:00:00 |
| 2/4/2017 0:57 | 2017-04-02 00:00:00 |
| 2/3/2017 23:37 | 2017-03-02 00:00:00 |
| 2/3/2017 23:37 | 2017-03-02 00:00:00 |
| 2/3/2017 23:40 | 2017-03-02 00:00:00 |
| 2/3/2017 23:50 | 2017-03-02 00:00:00 |
| 2/3/2017 23:51 | 2017-03-02 00:00:00 |
| 2/3/2017 23:53 | 2017-03-02 00:00:00 |
| 2/3/2017 23:55 | 2017-03-02 00:00:00 |
| 2/4/2017 0:08 | 2017-04-02 00:00:00 |
| 2/4/2017 0:57 | 2017-04-02 00:00:00 |

+----------------+---------------------+

Friday, 26 April 2019

basic unix

Find the disk usage
du

Note: In hp-UX bdf command we have to use.

Find exit Status of a unix command
Following the execution of a pipe, a $? gives the exit status of the last command executed. After a script terminates, a $? from the command-line gives the exitstatus of the script, that is, the last command executed in the script, which is, by convention, 0 on success or an integer in the range 1 - 255 on error.

command to find current working directory in unix
pwd

Thursday, 25 April 2019

basic spark

Components

Following are some important components of Spark

Cluster Manager

Is used to run the Spark Application in Cluster Mode

Application

User program built on Spark. Consists of,
Driver Program

The Program that has SparkContext. Acts as a coordinator for the Application

Executors

Runs computation & Stores Application Data
Are launched at the beginning of an Application & runs for the entire life time of an Application
Each Application gets it own Executors
An Application can have multiple Executors
An Executor is not shared by Multiple Applications
Provides in-memory storage for RDDs
For an Application, No >1 Executors run in the same Node

Task

Represents a unit of work in Spark
Gets executed in Executor

Job

Parallel Computation consisting of multiple Tasks that gets spawned in response to Spark action.

SparkConf()
Configuration for Spark application

SparkContext()
Main entry point of Spark functionality.

Shuffling is the process of data transfer between stages.

Tip
Avoid shuffling at all cost. Think about ways to leverage existing partitions. Leverage partial aggregation to reduce data transfer.

By default, shuffling doesn’t change the number of partitions, but their content.

Avoid groupByKey and use reduceByKey or combineByKey instead.

groupByKey shuffles all the data, which is slow.

reduceByKey shuffles only the results of sub-aggregations in each partition of the data.

Number of executer is equal to no of partition

Number of executer is equal to no of tasks

Stages means that series of transformation
New stage gets created on shuffling

Nodemanager allocates the resources at the corresponding node

Driver and executer are logical

Block and partition are physical

Resource manager submitted the job and allocating resources to the overall cluster.

RDD,DATAFRAME,DATASET are storage object, that is distributed or split object, Although
SCALA objects are not distributed itself.

There are two kinds of transformations in spark:

Narrow transformations
Wide transformations

Narrow transformations:
Narrow transformations are the result of map, filter and in which data to be transformed
id from a single partition only, i.e. it is self-sustained.
An output RDD has partitions with records that originate from a
single partition in the parent RDD.

Wide Transformations
Wide transformations are the result of groupByKey and reduceByKey.
The data required to compute the records in a single partition may
reside in many partitions of the parent RDD.

Wide transformations are also called shuffle transformations as they may or may not depend on a shuffle.
All of the tuples with the same key must end up in the same partition, processed by the same task.
To satisfy these operations, Spark must execute RDD shuffle, which transfers data across cluster
and results in a new stage with a new set of partitions.

---------------------------------------------------------------------------------------------------------------------

If you plan to read and write from HDFS using Spark, there are two Hadoop configuration files that should be included on Spark’s classpath:

hdfs-site.xml, which provides default behaviors for the HDFS client.
core-site.xml, which sets the default filesystem name.

----------------------------------------------------------------------------------------------------------------------------------------

Properties

spark.driver.memory
1g
Amount of memory to use for the driver process, i.e. where SparkContext is initialized. (e.g. 1g, 2g).
Note: In client mode, this config must not be set through the SparkConf directly in your application,
because the driver JVM has already started at that point. Instead, please set this through the --driver-memory command line option
or in your default properties file.

spark.executor.memory
1g
Amount of memory to use per executor process (e.g. 2g, 8g).

lit function is for adding literal values as a column

import org.apache.spark.sql.functions._
df.withColumn("D", lit(750))