Introduction to Apache Spark Streaming
Apache Spark Streaming execution model is advantageous over traditional streaming systems for its fast recovery from failures, dynamic load balancing, streaming and interactive analytics, and native integration.
Spark Resilient Distributed Dataset(RDD)
Resilient Distributed Dataset (RDD) is the fault-tolerant and immutable primary data structure/abstraction in Apache Spark. It is a distributed collection of objects. The term ‘resilient’ in ‘Resilient Distributed Dataset’ refers…
What is Spark Shared Variables?
Shared variables are an abstraction in Apache Spark which is used in parallel operations in different nodes. When Spark runs a function in parallel as a set of tasks on…
Installing Apache Spark on Linux
Apache Spark is an open-source cluster-computing framework. This post will explain the steps for installing prebuilt version of Apache Spark 2.1.1 as a stand alone cluster in a Linux system. I have used Ubuntu as a debains based OS for this post.
What is Apache Spark? The Unified engine for large-scale data analytics.
Apache Spark is a distributed, in-memory and disk based optimized system which does real-time analytics using Resilient Distributed Data(RDD) Sets.Spark includes a streaming library, and a rich set of programming interfaces to make data processing and transformation easier.