What is Data Lake? Feature and Architecture

Table of Contents

Introduction to Data Lake

A Data Lake is a centralized data-centric storage architecture that is used for persisting a variety of data in its raw, unfiltered, and untransformed format. It is used as a single point of storage for all the raw and curated enterprise data sets. These data sets can be used for various use cases like reporting, visualization, analytics, and machine learning purposes.

Data Lake stores information from multiple access points. This platform is a distributed, highly scalable data store that can scale horizontally with changing data needs. We can set up Time-to-Live (TTL) for the entries in the Data lake to purge the data after the appropriate time if needed.

Architecture of Data Lake

Data Lake can store a variety of data from web server logs, databases, social media, data from Customer Relationship Management (CRM), inventory data, sales transaction data, and third-party data in the batch and real-time fashion. We can apply different kinds of transformations to the data to create Business Ready Datasets (BRD) when needed. This type of architecture on Data Lake is known as “Schema on Read” architecture. In this, architecture, this schema is applied only when we want to read or use the data for a certain purpose. This is opposite to the “Schema on Write” approach used in traditional storage systems like Data Warehouse.

Features of Data Lake

It is a cost-effective and prohibitive storage platform that captures and processes a vast amount of multi-structured data. Below are some important features of the Data lake.

Cross-platform environment
Geographically distributed across many data centers
Provides highly scalable with low-latency and high fault tolerance performance
Provides simultaneous read/write access to the data.

Types of Data in Data Lake

Structured Data

These data are in the form of rows and columns. It can be either a table in a hive warehouse or a BRD (Business Ready Datasets) of curated data sets.

Semi-Structured Data

These are flat files(CSV, TSV), XML and JSON files.

Unstructured Data

These are emails, documents, and PDF files with .msg,docx,.xls and .pdf extension.

Binary Data

These are images, audio, and video files coming from a variety of sources.

Conclusion

In this blog post, we learned about Data Lake, its architecture, and its Features.

Please share this blog post on social media and leave a comment with any questions or suggestions.

Introduction to Data Lake

Architecture of Data Lake

Features of Data Lake

Types of Data in Data Lake

Conclusion

Share this:

Like this:

You Might Also Like

What is the Data Ingestion Process? Key Concepts Needed for Data Strategy

What is Columnar Data Storage and its Types?

Apache HBase Data Model

What is a Flat File ? And Why is It Important?