Tuesday, 24 September 2019

some other useful blogs for Spark


http://apachesparkbook.blogspot.com

How to connect impala through shell?

coming soon


How to connect impala?

impala-shell -k -i clustername:portno -d database


how to execute hive query in terminal?

hive -S -e "use database"
hive -S -e "query";


how to see the log if a job fails or see log for particular job?

check the oozie link for the particular job


how to see table of a particular user?




Note: post your doubts in comment. so that I will help you



Thursday, 16 May 2019

basic unix

sed '1d' filename

it will delete the first line of the file.

sed '$d' filename

it will delete the last line of the file.

cat filename

it will show the content of the file

cat -b filename

it will show the content of the file with line number.

cat filename|grep -v 'word'

it will ignore the line which have a word.

sed 's/apple//g' filename

it will empty the file which has a word apple.

grep -r "word" *

search a particular word in a available lines


Note: post your doubts in comment. so that I will help you





Problem1--spark find duplicate records for a field in rdd sparkrddduplicates

I have data set like 10,"Name",2016,"Country" 11,"Name1",2016,"country1" 10,"Name",2016,"Country" 10,"Name",2016,"Country" 12,"Name2",2017,"Country2"
My problem statement is I have to find total count and duplicates count by year . My Result should be (year, totalrecords, duplicates) 2016,4,3 2017,1,0.
I have tried to solve this problem by

  1. val records = rdd.map {
  2. x =>
  3. val array = x.split(",")
  4. (array(2),x)
  5. }.groupByKey()
  6. val duplicates = records.map {
  7. x => val totalcount = x._2.size
  8. val duplicates = // find duplicates in iterator
  9. (x._1,totalcount,duplicates)
  10. }

It is running fine upto 10GB data. If I ran it on more data it is taking long time. I found that groupByKey is not a best approach.
Please suggest best approach to solve this problem.

Friday, 10 May 2019

fold in Spark

Fold in spark

Fold is a very powerful operation in spark which allows you to calculate many important values in O(n) time. If you are familiar with Scala collection it will be like using fold operation on collection. Even if you not used fold in Scala, this post will make you comfortable in using fold.

Syntax

def fold[T](acc:T)((acc,value) => acc)
The above is kind of high level view of fold api. It has following three things
  1. T is the data type of RDD
  2. acc is accumulator of type T which will be return value of the fold operation
  3. A function , which will be called for each element in rdd with previous accumulator.
Let’s see some examples of fold
Finding max in a given RDD
Let’s first build a RDD

example

scala> import org.apache.spark._
import org.apache.spark._

scala> val employeeData = List(("Jack",1000.0),("Bob",20000.0),("Carl",7000.0))
employeeData: List[(String, Double)] = List((Jack,1000.0), (Bob,20000.0), (Carl,7000.0))

scala>  val employeeRDD = sc.makeRDD(employeeData)
employeeRDD: org.apache.spark.rdd.RDD[(String, Double)] = ParallelCollectionRDD[0] at makeRDD at <console>:32

scala> val dummyEmployee = ("dummy",0.0);
dummyEmployee: (String, Double) = (dummy,0.0)

scala> val maxSalaryEmployee = employeeRDD.fold(dummyEmployee)((acc,employee) => {
     | if(acc._2 < employee._2) employee else acc})
maxSalaryEmployee: (String, Double) = (Bob,20000.0)

scala> println("employee with maximum salary is"+maxSalaryEmployee)
employee with maximum salary is(Bob,20000.0)

Monday, 6 May 2019

useful scenario in bigdata

1.how to insert alternative columns of a csv file into spark?


2.see this scenario
Input file:-

name^age^state
swathi^23^us
srivani^24^UK
ram^25^London
scala> case class schema(name:String,age:Int,brand_code:String)
scala> val rdd = sc.textFile("file://<file-path>/test1.csv")
scala> val rdd1= rdd.mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1) else iter }
scala> val df1 = rdd1.map(_.split("\\^")).map(x=> schema(x(0).toString,x(1).toInt,x(2).toString)).toDF()
(or)
scala> val df1 = rdd1.map(_.split('^')).map(x=> schema(x(0).toString,x(1).toInt,x(2).toString)).toDF()
Output:-

scala> df1.show()
+-------+---+----------+
|   name|age|brand_code|
+-------+---+----------+
| swathi| 23|        us|
|srivani| 24|        UK|
|    ram| 25|    London|
+-------+---+----------+


3.how will you add coumn name in a text file?

import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{IntegerType,StringType,StructField,StructType}


val schema =new StructType().add(StructField("name",StringType,true)).add(StructField("age",IntegerType,true)).add(StructField("state",StringType,true))


val data = sc.textFile("/user/206571870/sample.csv")
val header = data.first()
  1. val rdd = data.filter(row => row != header)
  2. val rowsRDD = rdd.map(x => x.split(",")).map(x => Row(x(0),x(1).toInt,x(2)))
  3. val df = sqlContext.createDataFrame(rowsRDD,schema)

or

val df1 = sc.textFile("testfile.txt").Map(_.split('|')).map(x=> schema(x(0).toString,x(1).toInt,x(2).toString)).toDF()





Saturday, 4 May 2019

useful simple scenario scala

rdd.zipWithIndex.filter(_._2==9).map(_._1).first()
The first function transforms the RDD into a pair (value, idx) with idx going from 0 onwards. The second function takes the element with idx==9 (the 10th). The third function takes the original value. Then the result is returned.
The first function could be pulled up by the execution engine and influence the behavior of the whole processing. Give it a try.
In any case, if n is very large, this method is efficient in that it does not require to collect an array of the first n elements in the driver node.


//yet to be modify

Friday, 3 May 2019

basic hadoop

Variations of Hadoop ls Shell Command
$ hadoop fs –ls /
Returns all the available files and subdirectories present under the root directory.
$ hadoop fs –ls –R /user/cloudera
Returns all the available files and recursively lists all the subdirectories under /user/Cloudera
4) mkdir- Used to create a new directory in HDFS at a given location.
Example of HDFS mkdir Command -
$ hadoop fs –mkdir /user/cloudera/sample
The above command will create a new directory named dezyre1 under the location /user/cloudera
Note : Cloudera and other hadoop distribution vendors provide /user/ directory with read/write permission to all users but other directories are available as read-only.Thus, to create a folder in the root directory, users require superuser permission  as shown below -
$ sudo –u hdfs hadoop fs –mkdir /test
This command will create a new directory named dezyre under the / (root directory).

5) copyFromLocal

Copy a file from local filesytem to HDFS location.
For the following examples, we will use Sample.txt file available in the /home/Cloudera location.
xample - $ hadoop fs –copyFromLocal Sample1.txt /user/cloudera/test
Copy/Upload Sample1.txt available in /home/cloudera (local default) to /user/cloudera/dezyre1 (hdfs path)

6) put –

This hadoop command uploads a single file or multiple source files from local file system to hadoop distributed file system (HDFS).
Ex - $ hadoop fs –put Sample2.txt /user/cloudera/test
Copy/Upload Sample2.txt available in /home/cloudera (local default) to /user/cloudera/dezyre1 (hdfs path)

7) moveFromLocal 

This hadoop command functions similar to the put command but the source file will be deleted after copying.
Example - $ hadoop fs –moveFromLocal Sample3.txt /user/cloudera/dezyre1
Move Sample3.txt available in /home/cloudera (local default) to /user/cloudera/dezyre1 (hdfs path). Source file will be deleted after moving.

8) du

Displays the disk usage for all the files available under a given directory.
Example - $ hadoop fs –du /user/cloudera/test

9) df

 Displas disk usage of current hadoop distributed file system.

Example - $ hadoop fs –df

10) Expunge

This HDFS command empties the trash by deleting all the files and directories.
Example - $ hadoop fs –expunge

11) Cat

This is similar to the cat command in Unix and displays the contents of a file.
Example - $ hadoop fs –cat /user/cloudera/test/Sample1.txt

12) cp 

Copy files from one HDFS location to another HDFS location.
Example – $ hadoop fs –cp /user/cloudera/test/war_and_peace /user/cloudera/test/

13) mv 

Move files from one HDFS location to another HDFS location.
Example – $ hadoop fs –mv /user/cloudera/test/Sample1.txt /user/cloudera/tes/

14) rm

Removes the file or directory from the mentioned HDFS location.
Example – $ hadoop fs –rm -r /user/cloudera/test
Example  – $ hadoop fs –rm -r /user/cloudera/test
Deletes or removes the directory and its content from HDFS location in a recursive manner.
Example – $ hadoop fs –rm /user/cloudera/test
  Delete or remove the files from HDFS location.

15) tail 

This hadoop command will show the last kilobyte of the file to stdout.
Example – $ hadoop fs -tail /user/cloudera/test/war_and_peace

16) copyToLocal

Copies the files to the local filesystem . This is similar to hadoop fs -get command but in this case the destination location msut be a local file reference
Example - $ hadoop fs –copyFromLocal /user/cloudera/test/Sample1.txt /home/cloudera/hdfs_bkp/
Copy/Download Sample1.txt available in /user/cloudera/test (hdfs path) to /home/cloudera/hdfs_bkp/ (local path)

17) get

Downloads or Copies the files to the local filesystem.
Example - $ hadoop fs –get /user/cloudera/test/Sample2.txt /home/cloudera/hdfs_bkp/
Copy/Download Sample2.txt available in /user/cloudera/test (hdfs path) to /home/cloudera/hdfs_bkp/ (local path)

18) touchz

Used to create an emplty file at the specified location.
Example - $ hadoop fs –touchz /user/cloudera/test/Sample4.txt
It will create a new empty file Sample4.txt in /user/cloudera/test/ (hdfs path)

19) setrep

This hadoop fs command is used to set the replication for a specific file.
Example - $ hadoop fs –setrep –w 1 /user/cloudera/test/Sample1.txt
It will set the replication factor of Sample1.txt to 1

20) chgrp

This hadoop command is basically used to change the group name.
Example - $ sudo –u hdfs hadoop fs –chgrp –R cloudera /test
It will change the /test directory group membership from supergroup to cloudera (To perform this operation superuser permission is required)

21) chown

This command lets you change both the owner and group name simulataneously.
Example - $ sudo –u hdfs hadoop fs –chown –R cloudera /test
It will change the /dezyre directory ownership from hdfs user to cloudera user (To perform this operation superuser is permission required)

22) hadoop chmod

Used to change the permissions of a given file/dir.
Example - $ hadoop fs –chmod /test
It will change the /dezyre directory permission to 700 (drwx------).






















Interview Preperation remainder

About your project
Spark architechture?
Hive queries,joins
hive manages(internal)table vs external table.
Type of data you have used
Sqoop commands
Different ways of creating dataframe?
Spark initiation
Difference between client mode and cluster mode?








basic sqoop

sqoop import
 --connect jdbs:mysqllocalhost:3000 -username sat -password pass
--table tablename
--incremental-by append
--target-dir location
--last-value somevalue

Monday, 29 April 2019

basic hive

1.Give the command to see the indexes on a table.
SHOW INDEX ON table_name

This will list all the indexes created on any of the columns in the table table_name.


2.How do you specify the table creator name when creating a table in Hive?
The TBLPROPERTIES clause is used to add the creator name while creating a table.

The TBLPROPERTIES is added like −

TBLPROPERTIES(‘creator’= ‘Joan’)


3.how you specify the hive transactional properties?
TBLPROPERTIES(‘Transactional’= ‘True’)


4.How will you create hive table?
create table tablename(id int,name string,age int) row format delimited fields terminated by ',';

5.simple scenario for date format conversion

create table mytable (dt_tm string);

insert into mytable values
    ('2/3/2017 23:37')
   ,('2/3/2017 23:37')
   ,('2/3/2017 23:40')
   ,('2/3/2017 23:50')
   ,('2/3/2017 23:51')
   ,('2/3/2017 23:53')
   ,('2/3/2017 23:55')
   ,('2/4/2017 0:08' )
   ,('2/4/2017 0:57' )
;


select  dt_tm
       ,cast(from_unixtime(unix_timestamp(dt_tm,'dd/MM/yyyy HH:mm'),'yyyy-MM-dd 00:00:00') as timestamp)

from    mytable
;


+----------------+---------------------+
| 2/3/2017 23:37 | 2017-03-02 00:00:00 |
| 2/3/2017 23:37 | 2017-03-02 00:00:00 |
| 2/3/2017 23:40 | 2017-03-02 00:00:00 |
| 2/3/2017 23:50 | 2017-03-02 00:00:00 |
| 2/3/2017 23:51 | 2017-03-02 00:00:00 |
| 2/3/2017 23:53 | 2017-03-02 00:00:00 |
| 2/3/2017 23:55 | 2017-03-02 00:00:00 |
| 2/4/2017 0:08  | 2017-04-02 00:00:00 |
| 2/4/2017 0:57  | 2017-04-02 00:00:00 |
| 2/3/2017 23:37 | 2017-03-02 00:00:00 |
| 2/3/2017 23:37 | 2017-03-02 00:00:00 |
| 2/3/2017 23:40 | 2017-03-02 00:00:00 |
| 2/3/2017 23:50 | 2017-03-02 00:00:00 |
| 2/3/2017 23:51 | 2017-03-02 00:00:00 |
| 2/3/2017 23:53 | 2017-03-02 00:00:00 |
| 2/3/2017 23:55 | 2017-03-02 00:00:00 |
| 2/4/2017 0:08  | 2017-04-02 00:00:00 |
| 2/4/2017 0:57  | 2017-04-02 00:00:00 |

+----------------+---------------------+





Friday, 26 April 2019

basic unix

Find the disk usage
du

Note: In hp-UX bdf command we have to use.


Find exit Status of a unix command
Following the execution of a pipe, a $? gives the exit status of the last command executed. After a script terminates, a $? from the command-line gives the exitstatus of the script, that is, the last command executed in the script, which is, by convention, 0 on success or an integer in the range 1 - 255 on error.

command to find current working directory in unix
pwd 



Thursday, 25 April 2019

basic spark

Components

Following are some important components of Spark
  1. Cluster Manager
    1. Is used to run the Spark Application in Cluster Mode
  2. Application
    • User program built on Spark. Consists of,
    • Driver Program
      • The Program that has SparkContext. Acts as a coordinator for the Application
    • Executors
      • Runs computation & Stores Application Data
      • Are launched at the beginning of an Application & runs for the entire life time of an Application
      • Each Application gets it own Executors
      • An Application can have multiple Executors
      • An Executor is not shared by Multiple Applications
      • Provides in-memory storage for RDDs
      • For an Application, No >1 Executors run in the same Node
  3. Task
    1. Represents a unit of work in Spark
    2. Gets executed in Executor
  4. Job 
    1. Parallel Computation consisting of multiple Tasks that gets spawned in response to Spark action.


SparkConf()
Configuration for Spark application

SparkContext()
Main entry point of Spark functionality.

Shuffling is the process of data transfer between stages.

Tip
Avoid shuffling at all cost. Think about ways to leverage existing partitions. Leverage partial aggregation to reduce data transfer.

By default, shuffling doesn’t change the number of partitions, but their content.

Avoid groupByKey and use reduceByKey or combineByKey instead.

groupByKey shuffles all the data, which is slow.

reduceByKey shuffles only the results of sub-aggregations in each partition of the data.


Number of executer is equal to no of partition

Number of executer is equal to no of tasks

Stages means that series of transformation
New stage gets created on shuffling

Nodemanager allocates the resources at the corresponding node

Driver and executer are logical

Block and partition are physical

Resource manager submitted the job and allocating resources to the overall cluster.

RDD,DATAFRAME,DATASET are storage object, that is distributed or split object, Although
SCALA objects are not distributed itself.


There are two kinds of transformations in spark:


  1. Narrow transformations
  2. Wide transformations

Narrow transformations:
Narrow transformations are the result of map, filter and in which data to be transformed
id from a single partition only, i.e. it is self-sustained.
An output RDD has partitions with records that originate from a
single partition in the parent RDD.

Wide Transformations
Wide transformations are the result of groupByKey and reduceByKey.
The data required to compute the records in a single partition may
reside in many partitions of the parent RDD.

Wide transformations are also called shuffle transformations as they may or may not depend on a shuffle.
All of the tuples with the same key must end up in the same partition, processed by the same task.
To satisfy these operations, Spark must execute RDD shuffle, which transfers data across cluster
and results in a new stage with a new set of partitions.

---------------------------------------------------------------------------------------------------------------------

If you plan to read and write from HDFS using Spark, there are two Hadoop configuration files that should be included on Spark’s classpath:
  • hdfs-site.xml, which provides default behaviors for the HDFS client.
  • core-site.xml, which sets the default filesystem name.

----------------------------------------------------------------------------------------------------------------------------------------

Properties


spark.driver.memory
1g
Amount of memory to use for the driver process, i.e. where SparkContext is initialized. (e.g. 1g, 2g).
Note: In client mode, this config must not be set through the SparkConf directly in your application,
because the driver JVM has already started at that point. Instead, please set this through the --driver-memory command line option
or in your default properties file.

spark.executor.memory
1g
Amount of memory to use per executor process (e.g. 2g, 8g).



lit function is for adding literal values as a column
import org.apache.spark.sql.functions._
df.withColumn("D", lit(750))