Background
When I was working with Kafka, I did a lot of research into event-driven messaging and event-based architecture. While doing this research, I stumbled upon Apache NiFi, which helps to create complex data flows for a distributed or Internet of Things (IoT) based application. I decided to do this write-up which introduces NiFi which will be a key player in the IoT-based application in the future.
Introduction
Apache NiFi is an open-source tool for automating and managing the flow of data between systems (Databases, Sensors, Hadoop, Data platforms, and other sources). It solves the problem of real-time collecting and transporting data from a multitude of data sources and also provides an interactive user interface and control of live flows with full and automated data provenance.
It is a data source agnostic framework. Furthermore, it supports disparate and distributed sources of differing formats, and schemas that can follow protocols and can travel at varying speeds, and sizes. These different data sources can be below.
- Machines
- geolocation devices
- clickstreams
- files
- social feeds
- log files
- and videos
It is configurable plumbing for moving data around, similar to how FedEx, UPS, or other courier/ delivery services move parcels around. And just like those services, Apache NiFi allows you to trace your data in real-time, just like you could trace a delivery.
This project is written using flow-based programming using Java and provides a web-based user interface to manage data flows in real-time. NiFi provides the data acquisition, simple event processing, transport, and delivery mechanism designed to accommodate the diverse data flows generated by a world of connected people, systems, and things.
This project was classified by, United States National Security Agency (NSA) for 8 years and was named Niagra files. The NSA made this application open-source through Apache Source Foundation in 2014 via its technology transfer program.
NiFi is helpful in creating DataFlow. It means you can transfer data from one system to another system, as well as process the data in between.
Figure: Apache NiFi Architecture
NiFi Real World Use Case
NiFi is used for data ingestion to pull data into NiFi, from numerous data sources and create FlowFiles. It can process extremely large data, extremely large data sets, tiny data with high rates, and variable-sized data. It can be used for various use cases, some of which are given below.
- Schlumberger: From Zero to DataFlow in Hours with Apache NiFi
- Neustar: Make Streaming Analytics Work for You,
- Comcast: Improving Customer Experience
- Capital One, as part of Apache Metron
- Hadoop Summit Keynote
- Connected Car: GENIVI
- Connected Car: TU Automotive
- Real-Time Retail: Customer Interaction with Facial Recognition, Live Voting, and Electronic Conversation
- Data Hacks and Demos keynote from Hadoop Summit San Jose
- Reading children’s books
- Voice-activated real-time stock quotes
- Apache NiFi and Facebook
- Apache NiFi and SFDC
- Apache NiFi and Raspberry Pi Apache NiFi and Sprinkler System
NiFi vs Kafka
Both Apache NiFi and Apache Kafka provide a broker to connect producers and consumers, but they do so in a way that is quite different from one another and complementary when looking holistically at what it takes to connect the enterprise.
With Kafka, the logic of the data flow lives in systems that produce data and systems that consume data. NiFi decouples the producer and consumer further and allows as much of the dataflow logic as possible or desired to live in a broker itself. This is why NiFi has interactive command and control to effect immediate change and why NiFi offers the processor API to operate on, alter, and route the data streams as they flow. It is also why NiFi provides powerful back-pressure and congestion control features. The model NiFi offers means you do have a point of central control with distributed execution, where you can address cross-cutting concerns; where you can tackle things like compliance checks and track which you would not want on the producer/consumers.
Push vs Pull Data Ingestion Pattern
In terms of this data ingestion pattern, Kafka producers push data to the Kafka broker and Kafka consumers pull data from the Kafka broker. Though it is a clean and scalable model, it requires that system to accept and follow that protocol. In contrast, NiFi does not have that specific protocol. It supports both push/pull data ingestion patterns to get data in and out of NiFi
High Availability
On the data plane, NiFi does not offer distributed data durability today, as Kafka does. As Lars pointed out, the NiFi community is adding distributed durability, but the value of it for NiFi’s use cases will be less vital than it is for Kafka, as NiFi isn’t holding the data for the arbitrary consumer pattern that Kafka supports. If a NiFi node goes down, the data is delayed while it is down. Avoiding data loss, though, is easily solved thanks to tried-and-true RAID or distributed block storage. NiFi’s control plane does already provide high availability as the cluster manager and even multiple nodes in a cluster can be lost while the live flow can continue operating normally.
Performance
Kafka offers an impressive balance of both high throughput and low latency. But comparing the performance of Kafka and NiFi is not very meaningful, given that they do very different things. It would be best to discuss performance tradeoffs in the context of a particular use case.
Programming Language Supported by Apache NiFi
NiFi is implemented in the Java programming language and allows extensions (processors, controller services, and reporting tasks) to be implemented in Java. In addition, NiFi supports processors that execute scripts written in Groovy, Jython, and several other popular scripting languages.
Conclusion
In this blog post, we learned about Apache NiFi and its real-world use case.
Please share this blog post on social media and leave a comment with any questions or suggestions.
References:
Real World Use Cases of Real-Time DataFlows in Record Time – Hortonworks