Tuesday, 23 April 2019

SparkContext vs SparkSession


import org.apache.spark.sql.SparkSession

Object MultipleSparkSessions{
 def main(args:Array[String]):Unit={
  val sparksessions1=SparkSession.builder()
  .master("local")
  .appName("create multiple spark sessions")
  .getOrCreate()

  val sparksessions2=SparkSession.builder()
  .master("local")
  .appName("create multiple spark sessions")
  .getOrCreate()

  val rdd1=sparksession1.SparkContext.paralleize(Array(1,2,3,4,5))
  val rdd2=sparksession2.SparkContext.paralleize(Array(100,101))

  rdd1.collect.foreach(println)

  rdd2.collect.foreach(println)
  }

Output
=====
1
2
3
4
5

100
101

   we can create more than one sparksession in a single job .and rdd the rdd's are createing properly or not



In spark 1.x cant create more than one sparksession ,in  spark 2.x we can create more than one
sparksession in a single job.



bundled sparkcontext sparksqlcontext and hive context into a single
thing which can be access through sparkSession

we have driver, as part of driver we have spark context.Any command
that I want to execute have to pass it to spark context.it will take care of
and execute through executors.

Imagine a senario that we can have multiple users they wanted to use the cluster.
wanted to run their queries on top of cluster. how will you handle it?
everyuser can have one sparksession but their will be one Sparkcontext.

every user can have their own SparkSession set their own properties.have their own configuration on the sparksession and they can also have their own table.what ever table they create as part of sparksql they will be having their own copy and  its visible only within that whole spark session.it will not be visible to other users.

Spark context represents application.spark session represents of users session through the spark context.


Why I need Sapark Session?
       Every user wants his own
       set of properties own set of tables


Cluster is shared from Resources Point of View


Spark Context:
Prior to Spark 2.0.0 sparkContext was used as a channel to access all spark functionality.
The spark driver program uses spark context to connect to the cluster through a resource manager (YARN orMesos..).
sparkConf is required to create the spark context object, which stores configuration parameter like appName (to identify your spark driver), application, number of core and memory size of executor running on worker node.

In order to use APIs of SQL, HIVE, and Streaming, separate contexts need to be created.

Example:
creating sparkConf :

val conf = new SparkConf().setAppName(“RetailDataAnalysis”).setMaster(“spark://master:7077”).set(“spark.executor.memory”, “2g”)

creation of sparkContext:
val sc = new SparkContext(conf)
Spark Session:

SPARK 2.0.0 onwards, SparkSession provides a single point of entry to interact with underlying Spark functionality and
allows programming Spark with DataFrame and Dataset APIs. All the functionality available with sparkContext are also available in sparkSession.

In order to use APIs of SQL, HIVE, and Streaming, no need to create separate contexts as sparkSession includes all the APIs.

Once the SparkSession is instantiated, we can configure Spark’s run-time config properties.

Example:

Creating Spark session:
val spark = SparkSession
.builder
.appName(“WorldBankIndex”)
.getOrCreate()

Configuring properties:
spark.conf.set(“spark.sql.shuffle.partitions”, 6)
spark.conf.set(“spark.executor.memory”, “2g”)

Spark 2.0.0 onwards, it is better to use sparkSession as it provides access to all the spark Functionalities that sparkContext does. Also, it provides APIs to work on DataFrames and Datasets.








No comments:

Post a Comment