Starting Apache Spark Application

Post author:nitendratech
Post category:Spark
Post comments:0 Comments
Post published:November 26, 2023

To start an Apache Spark application, we need to create an entry point using a Spark session, configure Spark application properties, and then define the data processing logic. Spark Context…

What is Parallelism in Apache Spark?

Post author:nitendratech
Post category:Spark
Post comments:0 Comments
Post published:April 20, 2023

Parallelism refers is the ability to perform multiple tasks simultaneously by slicing the data into smaller partitions and processing them in parallel across multiple nodes in a cluster. Apache Spark…

What are the types of Cluster Manager in Spark?

Post author:nitendratech
Post category:Spark
Post comments:0 Comments
Post published:February 15, 2022

A cluster manager is an external resource or a server through which Spark jobs can be submitted. It helps to acquire resources in the Spark cluster. Spark applications are independent…

Spark Tutorials

Post author:nitendratech
Post category:Spark
Post comments:2 Comments
Post published:September 11, 2021

Apache Spark is a distributed, in-memory, and disk-based optimized open-source framework which does real-time analytics using Resilient Distributed Data(RDD) sets. It includes a streaming library, and a rich set of…

SparkSession in Apache Spark

Post author:nitendratech
Post category:Spark
Post comments:2 Comments
Post published:May 24, 2020

SparkSession has been the main entry point to Spark applications since Spark 2.0. Before Spark 2.0 Spark Context was the main entry point for any Spark applications. We see how…

Caching and Persisting Mechanism in Spark

Post author:nitendratech
Post category:Spark
Post comments:1 Comment
Post published:October 1, 2019

Caching and persistence are optimization techniques for (iterative and interactive) Apache Spark computations. This technique helps to save interim partial results, which can be reused in subsequent stages. These results,…

Data Locality in Spark

Post author:nitendratech
Post category:Spark
Post comments:0 Comments
Post published:September 24, 2019

Data locality refers to how close data is to the code processing it. Having the code and the data together tends to make computations faster in Apache Spark. If the…

Accessing Hive in HDP3 using Apache Spark

Post author:nitendratech
Post category:Spark
Post comments:1 Comment
Post published:July 15, 2019

If you are switching from Hortwonwork Data Platform(HDP) 2.6 To 3.0+, you will have a hard time accessing Hive Tables through the Apache Spark shell. HDP 3 introduced something called…

User-Defined Aggregate Functions(UDAF) Using Apache Spark

Post author:nitendratech
Post category:Spark
Post comments:0 Comments
Post published:May 31, 2019

UDAF stands for User Defined Aggregate functions. Aggregate functions are used to perform a calculation on a set of values and return a single value. It is difficult to write an aggregate function compared to writing a User Defined Functions(UDF) as we need to aggregate on multiple rows and columns. Apache Spark UDAF operates on more than one row or Column while returning a single value results

Submit Apache Spark Job with REST API

Post author:nitendratech
Post category:Spark
Post comments:1 Comment
Post published:April 29, 2018

When working with Apache spark ,there are times when you need to trigger a Spark job on demand from withing and outside the cluster.There are two ways in which we can submit Apache spark job in a cluster which includes bash script and REST API.