To start an Apache Spark application, we need to create an entry point using a Spark session, configure Spark application properties, and then define the data processing logic.
Spark Context
Spark Context is the main entry point for Apache Spark before Spark 2.0, which represents the connection to a Spark cluster. It can be used to create RDDs, accumulators, and broadcast variables on that cluster.
We can only have one SparkContext active per JVM (Java Virtual Machine). We need to stop the active SparkContext before creating a new one.
Spark Context Creation Example
import org.apache.spark.{SparkConf ,SparkContext}
//Create Spark Configuration
val conf = new SparkConf()
.setAppName("Spark Notes")
.setMaster("local[*]") // Local Mode exection
//Create Spark Context
val sparkContext = new SparkContext(sc)
Parallellized Collections
Parallelized collections are created by calling Spark Context’s parallelize method on an existing collection in your driver program (a Scala Seq). The elements of the collection are copied to form a distributed dataset that can be operated on in parallel.
Example:
val arrayData = Array(4,6,9,11)
val parallellizedData = sparkContext.parallelize(arrayData) //sparkContext is Spark Context Object