SparkSession has been the main entry point to Spark applications since Spark 2.0. Before Spark 2.0 Spark Context was the main entry point for any Spark applications. We see how to start a SparkSession in a Spark application, along with some important properties.
SparkSession provides a single point of entry to interact with the underlying Spark APIs and functions. It allows the Spark application to interact with the DataFrame and Dataset API. It also contains all the functionalities that are available as part of SparkContext. Furthermore, it is being created using SparkSession.builder()
builder patterns.
If we want to use APIs of SQL, Hive, and Streaming, we don’t need to create separate Spark contexts. Instead, we can use SparkSession and access all of this API.
We can also add Spark’s various config properties while invoking SparkSession using the config
method.
SparkSession through Scala and Java Program
If we want to create a SparkSession using Scala/Java or Python, we need to use the builder()
method and set the master, Application name along with other properties and use getOrCreate()
method. This getOrCreate()
the method will return the already existing SparkSession. If the SparkSession does not already exist, it will create a new SparkSession.
Using Scala
Let’s see an example of SparkSession using Scala.
val sparkSession = SparkSession.builder()
.config("spark.executor.memory", "2g")
.appName("SparkExample")
.master("local[2]")
.getOrCreate()
Using Java
Let’s see an example of SparkSession using Java.
SparkSession sparkSession = SparkSession.config("spark.executor.memory", "2g")
.appName("SparkExample")
.master("local[2]")
.getOrCreate();
Important SparkSession methods
Let’s see some important and commonly used SparkSession methods.
Method | Description |
version | It returns a Spark Version of the application is running. Example: 2.4.0 |
createDataFrame() | It Creates a Daraframe from the collection and an RDD. |
emptyDataFrame() | It creates an empty DataFrame. |
emptyDataset() | It creates a Dataset from the collection. |
createDataset() | It creates a Dataset from the collection, DataFrame, and RDD. |
readStream() | It Returns an instance of DataStreamReader the class which is used for reading streaming Data |
getActiveSession() | getActiveSession: Option[SparkSession] Returns an active SparkSession |
read() | Ir returns an instance of DataFrameReader a class that is used to read records from external data sources like CSV/Parquet/Avro. |
udf() | Registers a UDF (User Defined Function) |
sql() | sql(sqlText: String): DataFrame It executes a SQL query and returns a DataFrame |
sqlContext() | sqlContext: SQLContext It accesses the underlying SQLContext |
sparkContext() | It returns a SparkContext |
implicits() | import spark.implicits._ It allows implicit conversions for converting Scala objects into a Dataset, DataFrame, Columns, etc |