Introduction to Apache Flume: Components and Channels

nitendratech

5 years ago

Apache Flume is an Apache open source project used for moving massive quantities of streaming data into HDFS. It collects log data from the web server logs files and aggregates it in HDFS for analysis.

It supports multiple sources like –’tail’, System logs, Apache Access Logs, and Apache log4j. Furthermore, it provides end-to-end reliability because of its transactional approach to the data flow.

Table of Contents

Toggle

Flume Core Components

The core components of Flume are given below.

Event: It is the single log entry or unit of data that is transported.
Source: It is the component through which data enters Flume workflows.
Sink: It is responsible for transporting data to the desired destination.
Channel: It is the medium that connects the Sink with the Source.
Agent: An Agent is any JVM (Java Virtual Machine) that runs Flume.
Client: The component that transmits the event to the source that operates with the agent.

Channels in Apache Flume

Apache Flume has three different built-in channels to transport log files.

Memory Channel

In the memory-based channel, events are read from the source into memory and passed to the sink.

JDBC (Java Database Connectivity) Channel

JDBC channel stores the events in an embedded Derby database.

File Channel

This channel first writes the contents to a file on the file system after reading the event from a source. This file is then deleted only after the contents are successfully delivered to the sink.

Among all these channels, the memory channel is the fastest channel whereas the file channel is the most reliable one. One caveat of using a memory channel is that it has the risk of data loss. But File-based channel does not have data loss. The different organization chooses different channels according to their use case.