Don't read,understand it: basic spark

Components

Following are some important components of Spark

Cluster Manager

Is used to run the Spark Application in Cluster Mode

Application

User program built on Spark. Consists of,
Driver Program

The Program that has SparkContext. Acts as a coordinator for the Application

Executors

Runs computation & Stores Application Data
Are launched at the beginning of an Application & runs for the entire life time of an Application
Each Application gets it own Executors
An Application can have multiple Executors
An Executor is not shared by Multiple Applications
Provides in-memory storage for RDDs
For an Application, No >1 Executors run in the same Node

Task

Represents a unit of work in Spark
Gets executed in Executor

Job

Parallel Computation consisting of multiple Tasks that gets spawned in response to Spark action.

SparkConf()
Configuration for Spark application

SparkContext()
Main entry point of Spark functionality.

Shuffling is the process of data transfer between stages.

Tip
Avoid shuffling at all cost. Think about ways to leverage existing partitions. Leverage partial aggregation to reduce data transfer.

By default, shuffling doesn’t change the number of partitions, but their content.

Avoid groupByKey and use reduceByKey or combineByKey instead.

groupByKey shuffles all the data, which is slow.

reduceByKey shuffles only the results of sub-aggregations in each partition of the data.

Number of executer is equal to no of partition

Number of executer is equal to no of tasks

Stages means that series of transformation
New stage gets created on shuffling

Nodemanager allocates the resources at the corresponding node

Driver and executer are logical

Block and partition are physical

Resource manager submitted the job and allocating resources to the overall cluster.

RDD,DATAFRAME,DATASET are storage object, that is distributed or split object, Although
SCALA objects are not distributed itself.

There are two kinds of transformations in spark:

Narrow transformations
Wide transformations

Narrow transformations:
Narrow transformations are the result of map, filter and in which data to be transformed
id from a single partition only, i.e. it is self-sustained.
An output RDD has partitions with records that originate from a
single partition in the parent RDD.

Wide Transformations
Wide transformations are the result of groupByKey and reduceByKey.
The data required to compute the records in a single partition may
reside in many partitions of the parent RDD.

Wide transformations are also called shuffle transformations as they may or may not depend on a shuffle.
All of the tuples with the same key must end up in the same partition, processed by the same task.
To satisfy these operations, Spark must execute RDD shuffle, which transfers data across cluster
and results in a new stage with a new set of partitions.

---------------------------------------------------------------------------------------------------------------------

If you plan to read and write from HDFS using Spark, there are two Hadoop configuration files that should be included on Spark’s classpath:

hdfs-site.xml, which provides default behaviors for the HDFS client.
core-site.xml, which sets the default filesystem name.

----------------------------------------------------------------------------------------------------------------------------------------

Properties

spark.driver.memory
1g
Amount of memory to use for the driver process, i.e. where SparkContext is initialized. (e.g. 1g, 2g).
Note: In client mode, this config must not be set through the SparkConf directly in your application,
because the driver JVM has already started at that point. Instead, please set this through the --driver-memory command line option
or in your default properties file.

spark.executor.memory
1g
Amount of memory to use per executor process (e.g. 2g, 8g).

lit function is for adding literal values as a column

import org.apache.spark.sql.functions._
df.withColumn("D", lit(750))

Don't read,understand it

Thursday, 25 April 2019

basic spark

Components

Properties

No comments:

Post a Comment