Introduction to Apache Spark Streaming
Apache Spark Streaming execution model is advantageous over traditional streaming systems for its fast recovery from failures, dynamic load balancing, streaming and interactive analytics, and native integration.
What is a Database and Why it is Important? Facts and Types
What is Data? Before we learn about databases, we need to first understand what data is. Data is information or facts related to an object that is under consideration. In…
Spark Resilient Distributed Dataset(RDD)
Resilient Distributed Dataset (RDD) is the fault-tolerant and immutable primary data structure/abstraction in Apache Spark. It is a distributed collection of objects. The term ‘resilient’ in ‘Resilient Distributed Dataset’ refers…
Joins using MapReduce Framework
There are 3 types of joins, Reduce-Side joins, Map-Side joins, and memory-backed Joins that can be used to join Tables in MapReduce. Map Side Join Joining at the map side…
What is Spark Shared Variables?
Shared variables are an abstraction in Apache Spark which is used in parallel operations in different nodes. When Spark runs a function in parallel as a set of tasks on…
Most Useful Apache Hadoop HDFS Commands
This post describes some of the most useful Apache Hadoop HDFS commands one would need when working in a Hadoop Cluster
Introduction to Hadoop Mapreduce framework
Hadoop Mapreduce framework is a Big data processing framework which consists of MapReduce programming model and Hadoop Distributed File System.