SparkSession in Apache Spark - Technology and Trends

SparkSession has been the main entry point to Spark applications since Spark 2.0. Before Spark 2.0 Spark Context was the main entry point for any Spark applications. We see how to start a SparkSession in a Spark application, along with some important properties.

SparkSession provides a single point of entry to interact with the underlying Spark APIs and functions. It allows the Spark application to interact with the DataFrame and Dataset API. It also contains all the functionalities that are available as part of SparkContext. Furthermore, it is being created using SparkSession.builder() builder patterns.

If we want to use APIs of SQL, Hive, and Streaming, we don’t need to create separate Spark contexts. Instead, we can use SparkSession and access all of this API.

We can also add Spark’s various config properties while invoking SparkSession using the config method.

Table of Contents

SparkSession through Scala and Java Program

If we want to create a SparkSession using Scala/Java or Python, we need to use the builder() method and set the master, Application name along with other properties and use getOrCreate() method. This getOrCreate() the method will return the already existing SparkSession. If the SparkSession does not already exist, it will create a new SparkSession.

Using Scala

Let’s see an example of SparkSession using Scala.

    val sparkSession = SparkSession.builder()
                    .config("spark.executor.memory", "2g")
                      .appName("SparkExample")
                      .master("local[2]")
                      .getOrCreate()

Using Java

Let’s see an example of SparkSession using Java.

SparkSession sparkSession = SparkSession.config("spark.executor.memory", "2g")
                      .appName("SparkExample")
                      .master("local[2]")
                      .getOrCreate();

Important SparkSession methods

Let’s see some important and commonly used SparkSession methods.

Method	Description
`version`	It returns a Spark Version of the application is running. Example: 2.4.0
`createDataFrame()`	It Creates a Daraframe from the collection and an RDD.
`emptyDataFrame()`	It creates an empty DataFrame.
`emptyDataset()`	It creates a Dataset from the collection.
`createDataset()`	It creates a Dataset from the collection, DataFrame, and RDD.
`readStream()`	It Returns an instance of `DataStreamReader` the class which is used for reading streaming Data
`getActiveSession()`	`getActiveSession: Option[SparkSession]` Returns an active SparkSession
`read()`	Ir returns an instance of `DataFrameReader` a class that is used to read records from external data sources like CSV/Parquet/Avro.
`udf()`	Registers a UDF (User Defined Function)
`sql()`	`sql(sqlText: String): DataFrame` It executes a SQL query and returns a DataFrame
`sqlContext()`	`sqlContext: SQLContext` It accesses the underlying SQLContext
`sparkContext()`	It returns a `SparkContext`
`implicits()`	`import spark.implicits._` It allows implicit conversions for converting Scala objects into a Dataset, DataFrame, Columns, etc

SparkSession through Scala and Java Program

Using Scala

Using Java

Important SparkSession methods

Share this:

Like this:

You Might Also Like

What is Spark Shared Variables?

What are the types of Cluster Manager in Spark?

Data Locality in Spark

Accessing Hive in HDP3 using Apache Spark