Components
Following are some important components of Spark
- Cluster Manager
- Is used to run the Spark Application in Cluster Mode
- Application
- User program built on Spark. Consists of,
- Driver Program
- The Program that has SparkContext. Acts as a coordinator for the Application
- Executors
- Runs computation & Stores Application Data
- Are launched at the beginning of an Application & runs for the entire life time of an Application
- Each Application gets it own Executors
- An Application can have multiple Executors
- An Executor is not shared by Multiple Applications
- Provides in-memory storage for RDDs
- For an Application, No >1 Executors run in the same Node
- Task
- Represents a unit of work in Spark
- Gets executed in Executor
- Job
- Parallel Computation consisting of multiple Tasks that gets spawned in response to Spark action.
SparkConf()
Configuration for Spark application
SparkContext()
Main entry point of Spark functionality.
Shuffling is the process of data transfer between stages.
Tip
Avoid shuffling at all cost. Think about ways to leverage existing partitions. Leverage partial aggregation to reduce data transfer.
By default, shuffling doesn’t change the number of partitions, but their content.
Avoid groupByKey and use reduceByKey or combineByKey instead.
groupByKey shuffles all the data, which is slow.
reduceByKey shuffles only the results of sub-aggregations in each partition of the data.
Number of executer is equal to no of partition
Number of executer is equal to no of tasks
Stages means that series of transformation
New stage gets created on shuffling
Nodemanager allocates the resources at the corresponding node
Driver and executer are logical
Block and partition are physical
Resource manager submitted the job and allocating resources to the overall cluster.
RDD,DATAFRAME,DATASET are storage object, that is distributed or split object, Although
SCALA objects are not distributed itself.
There are two kinds of transformations in spark:
- Narrow transformations
- Wide transformations
Narrow transformations:
Narrow transformations are the result of map, filter and in which data to be transformed
id from a single partition only, i.e. it is self-sustained.
An output RDD has partitions with records that originate from a
single partition in the parent RDD.
Wide Transformations
Wide transformations are the result of groupByKey and reduceByKey.
The data required to compute the records in a single partition may
reside in many partitions of the parent RDD.
Wide transformations are also called shuffle transformations as they may or may not depend on a shuffle.
All of the tuples with the same key must end up in the same partition, processed by the same task.
To satisfy these operations, Spark must execute RDD shuffle, which transfers data across cluster
and results in a new stage with a new set of partitions.
---------------------------------------------------------------------------------------------------------------------
If you plan to read and write from HDFS using Spark, there are two Hadoop configuration files that should be included on Spark’s classpath:
hdfs-site.xml
, which provides default behaviors for the HDFS client.core-site.xml
, which sets the default filesystem name.
----------------------------------------------------------------------------------------------------------------------------------------
Properties
spark.driver.memory
1g
Amount of memory to use for the driver process, i.e. where SparkContext is initialized. (e.g. 1g, 2g).
Note: In client mode, this config must not be set through the SparkConf directly in your application,
because the driver JVM has already started at that point. Instead, please set this through the --driver-memory command line option
or in your default properties file.
spark.executor.memory
1g
Amount of memory to use per executor process (e.g. 2g, 8g).
lit
function is for adding literal values as a columnimport org.apache.spark.sql.functions._
df.withColumn("D", lit(750))
No comments:
Post a Comment