import org.apache.spark.sql.SparkSession
Object MultipleSparkSessions{
def main(args:Array[String]):Unit={
val sparksessions1=SparkSession.builder()
.master("local")
.appName("create multiple spark sessions")
.getOrCreate()
val sparksessions2=SparkSession.builder()
.master("local")
.appName("create multiple spark sessions")
.getOrCreate()
val rdd1=sparksession1.SparkContext.paralleize(Array(1,2,3,4,5))
val rdd2=sparksession2.SparkContext.paralleize(Array(100,101))
rdd1.collect.foreach(println)
rdd2.collect.foreach(println)
}
Output
=====
1
2
3
4
5
100
101
we can create more than one sparksession in a single job .and rdd the rdd's are createing properly or not
In spark 1.x cant create more than one sparksession ,in spark 2.x we can create more than one
sparksession in a single job.
bundled sparkcontext sparksqlcontext and hive context into a single
thing which can be access through sparkSession
we have driver, as part of driver we have spark context.Any command
that I want to execute have to pass it to spark context.it will take care of
and execute through executors.
Imagine a senario that we can have multiple users they wanted to use the cluster.
wanted to run their queries on top of cluster. how will you handle it?
everyuser can have one sparksession but their will be one Sparkcontext.
every user can have their own SparkSession set their own properties.have their own configuration on the sparksession and they can also have their own table.what ever table they create as part of sparksql they will be having their own copy and its visible only within that whole spark session.it will not be visible to other users.
Spark context represents application.spark session represents of users session through the spark context.
Why I need Sapark Session?
Every user wants his own
set of properties own set of tables
Cluster is shared from Resources Point of View
Spark Context:
Prior to Spark 2.0.0 sparkContext was used as a channel to access all spark functionality.
The spark driver program uses spark context to connect to the cluster through a resource manager (YARN orMesos..).
sparkConf is required to create the spark context object, which stores configuration parameter like appName (to identify your spark driver), application, number of core and memory size of executor running on worker node.
In order to use APIs of SQL, HIVE, and Streaming, separate contexts need to be created.
Example:
creating sparkConf :
val conf = new SparkConf().setAppName(“RetailDataAnalysis”).setMaster(“spark://master:7077”).set(“spark.executor.memory”, “2g”)
creation of sparkContext:
val sc = new SparkContext(conf)
Spark Session:
SPARK 2.0.0 onwards, SparkSession provides a single point of entry to interact with underlying Spark functionality and
allows programming Spark with DataFrame and Dataset APIs. All the functionality available with sparkContext are also available in sparkSession.
In order to use APIs of SQL, HIVE, and Streaming, no need to create separate contexts as sparkSession includes all the APIs.
Once the SparkSession is instantiated, we can configure Spark’s run-time config properties.
Example:
Creating Spark session:
val spark = SparkSession
.builder
.appName(“WorldBankIndex”)
.getOrCreate()
Configuring properties:
spark.conf.set(“spark.sql.shuffle.partitions”, 6)
spark.conf.set(“spark.executor.memory”, “2g”)
Spark 2.0.0 onwards, it is better to use sparkSession as it provides access to all the spark Functionalities that sparkContext does. Also, it provides APIs to work on DataFrames and Datasets.
No comments:
Post a Comment