Starting Apache Spark Application
To start an Apache Spark application, we need to create an entry point using a Spark session, configure Spark application properties, and then define the data processing logic. Spark Context…
To start an Apache Spark application, we need to create an entry point using a Spark session, configure Spark application properties, and then define the data processing logic. Spark Context…
Parallelism refers is the ability to perform multiple tasks simultaneously by slicing the data into smaller partitions and processing them in parallel across multiple nodes in a cluster. Apache Spark…
A cluster manager is an external resource or a server through which Spark jobs can be submitted. It helps to acquire resources in the Spark cluster. Spark applications are independent…
Apache Spark is a distributed, in-memory, and disk-based optimized open-source framework which does real-time analytics using Resilient Distributed Data(RDD) sets. It includes a streaming library, and a rich set of…
SparkSession has been the main entry point to Spark applications since Spark 2.0. Before Spark 2.0 Spark Context was the main entry point for any Spark applications. We see how…
Caching and persistence are optimization techniques for (iterative and interactive) Apache Spark computations. This technique helps to save interim partial results, which can be reused in subsequent stages. These results,…
Data locality refers to how close data is to the code processing it. Having the code and the data together tends to make computations faster in Apache Spark. If the…
If you are switching from Hortwonwork Data Platform(HDP) 2.6 To 3.0+, you will have a hard time accessing Hive Tables through the Apache Spark shell. HDP 3 introduced something called…
UDAF stands for User Defined Aggregate functions. Aggregate functions are used to perform a calculation on a set of values and return a single value. It is difficult to write an aggregate function compared to writing a User Defined Functions(UDF) as we need to aggregate on multiple rows and columns. Apache Spark UDAF operates on more than one row or Column while returning a single value results
When working with Apache spark ,there are times when you need to trigger a Spark job on demand from withing and outside the cluster.There are two ways in which we can submit Apache spark job in a cluster which includes bash script and REST API.