Data Ingestion or Data Processing is a framework or a process in which data from multiple sources is captured to a storage layer that can be accessed, used, and analyzed by an organization. As data ingestion is a backbone of any modern data-driven architecture, all downstream application systems rely on the successful data ingestion process to get the correct data. This framework keeps the data storage layer consistent with the data changes that happen at the source system. It is one of the days to data tasks that every data engineer does on an everyday basis.
Destination of Data Ingestion Process
There can be multiple destinations for the data ingestion process. This destination can vary depending upon the organization and the requirement of the data ingestion task. Some common destinations are given below.
- Data Warehouse
- Data Mart
- Data Lake
- SQL (Structured Query Language) Database
- NoSQL(Non-Relational) Database
Source of Data Ingestion Process
Data Ingestion sources vary depending upon the nature of the organization and the type of processing they do. Below are some common Data Ingestion Sources
- Relational(SQL) and Non-Relational (NoSQL) Databases
- Legacy Systems
These are the data generated from legacy systems like CRM (Customer Relationship Management) Systems and mainframe-based systems.
- Sensor and IoT (Internet of Things) Devices
These are the data generated by smart devices in cell phones, healthcare, homes, self-driving cars, weather sensors, and smartwatches. They generate huge amounts of data on an everyday basis.
- Data Generated by the Social Media platforms
Social media platforms like Instagram, YouTube, Facebook, Twitter, LinkedIn, and other online sites generate large amounts of data every day. They contain free text, images, and videos that can be used to study the behavior analysis of the customers
- In-House Applications
Many organizations have their own in-house applications that generate a large amount of data every day. These are stored in either relational or non-relational databases
Types of Data Ingestion
In an organization, the business organization and the nature of the business define the characteristics of the data ingestion layer. With the right ingestion model, business requirements can be achieved in the right timeline, giving the business an edge over its competitors. The data Ingestion process can be classified into two types, batch, and streaming, based on its characteristics.
Batch Ingestion
In this ingestion type, source data is periodically collected and grouped, and sent to the target system. Batch processing is more feasible and affordable in an organization and processing is done in batches. This is generally done one time per day in many organizations.
Streaming or Real-time Ingestion
In this ingestion type, changes are applied to the data as they are received in real-time. As there is some lag between the time when the event happens on the data and the application processing it, it is also known as near real-time ingestion. Many organizations use a process known as change data capture (CDC) to track what values changed in certain columns. This is done mainly in real-time and is known as the event-driven CDC process. There is much software provided by vendors that provide this capability when their system is used in an organization.
Common Requirements in Data Ingestion/ Data Processing Framework
There are many requirements that drive the data ingestion process. They can vary depending upon the organization and type of data we are trying to ingest. But there are some common requirements for capturing certain metrics every organization wants to keep track of. Below are some of these metrics.
- Periodic Data Quality Measurement
- Data Owner of the data that is being ingested
- Security classification (Private, public, sensitive, highly sensitive) of the incoming data.
- The business definition of the data being ingested
- Ingested data source type (File/XML/JSON etc.)
- Business rules applied to the ingested data
- Lineage on the ingested data
Data Ingestion Software programs
There are many tools, both public and commercial, that can perform data ingestion. Choice of these tools depends upon the organization’s requirements, size, and complexity of the data, budget, programming language used, and skill set of the people associated with that organization. Some organizations might incline more towards Java programming language, whereas some might incline more towards Python.
Here are the popular tools for Data ingestion.
Open Source Software
- Apache Spark
- Apache Kafka
- Apache Nifi
- Apache Sqoop
- Apache Flume
- Apache Samza
- Apache Storm
- Fluentd
- Apache Gobblin
Commercial Software
- Data Torrent
- Amazon Kinesis
Conclusion
We read about the meaning of data ingestion, types of data ingestion, and source of data for ingestion. Besides this, we also learned about the different software or application that are applicable to the data ingestion process.
Please share this blog post on social media and leave a comment with any questions or suggestions.