Starting Apache Spark Application

Post category:Spark
Post comments:0 Comments
Post author:nitendratech
Post last modified:February 19, 2024

To start an Apache Spark application, we need to create an entry point using a Spark session, configure Spark application properties, and then define the data processing logic.

Table of Contents

Spark Context

Spark Context is the main entry point for Apache Spark before Spark 2.0, which represents the connection to a Spark cluster. It can be used to create RDDs, accumulators, and broadcast variables on that cluster.

We can only have one SparkContext active per JVM (Java Virtual Machine). We need to stop the active SparkContext before creating a new one.

Spark Context Creation Example

import org.apache.spark.{SparkConf ,SparkContext}

//Create Spark Configuration

val conf = new SparkConf()
   .setAppName("Spark Notes")
   .setMaster("local[*]") // Local Mode exection 
   
   //Create Spark Context
   val sparkContext = new SparkContext(sc)

Parallellized Collections

Parallelized collections are created by calling Spark Context’s parallelize method on an existing collection in your driver program (a Scala Seq). The elements of the collection are copied to form a distributed dataset that can be operated on in parallel.

Example:

val arrayData = Array(4,6,9,11)
val parallellizedData = sparkContext.parallelize(arrayData)  //sparkContext is Spark Context Object

Tags: Spark

Spark Context

Parallellized Collections

Share this:

Like this:

You Might Also Like

Introduction to Apache Spark Streaming

Installing Apache Spark on Linux

What is Spark Shared Variables?

Data Locality in Spark