Aquileo | How to Create AWS Data Lake

Organizations today generate massive volumes of structured, semi-structured, and unstructured data that require scalable and flexible storage solutions. An AWS Data Lake enables centralized ingestion, storage, cataloging, and querying of data at scale while maintaining security and compliance.
It allows businesses to store raw data first and analyze it later for dashboards, real-time analytics, and machine learning.

Built using services like Amazon S3, AWS Glue, AWS Lake Formation, AWS Athena, and IAM.
Stores data in its original format without requiring predefined structure.
Supports diverse analytics workloads including BI, real-time processing, and ML.
Provides elastic, secure, and cost-effective architecture for data-driven decision making.

Data Lake Architecture

The following diagram illustrates the AWS Data Lake Architecture and its components are discussed clearly in the below sections:

Data source: The first layer in Data Lake Architecture is the Data Source, where the data journey starts, where insightful data originates, and the data ingestion layer gathers data from these Data Sources. Data sources can be different places such as IoT devices, social platforms, databases, cloud applications, wearable smart devices, etc. Data from different sources can be classified into 3 types:

1. Structured Data: Structured data can be defined as most organized format of data. Example of structured data: Database and Excel spreadsheet.

2. Semi-structured Data: Semi structured data has some extent of organization but less organized as compared to structured data. Semi-structured data does not fit into tables. Example of semi-structured data: HTML, CSV, JSON and XML.

3. Unstructured Data: This is the type of data which is not organized and doesn’t have a pre-defined format. Example of Unstructured Data: images, videos, sensor data, and audio recordings.

Data Ingestion: Data ingestion layer also referred as “Raw data layer” gathers and ingest raw data into the data lake from multiple data sources. It acts as a first checkpoint where data enters the data lake, data enters in the data lake in either batch mode or real-time mode, before performing further processes.
Batch Mode: In batch mode, data is collected and ingested in a batches or group. This is scheduled interval-based process of data ingestion.
Real-time Mode: In this type of mode, data is ingested into the lake as soon as the data is generated. Real time is also called as streaming processing.

Data storage & Processing

At this layer, collected raw data from various data sources is stored. Here, data stored in its original format whether in structured, semi-structured or unstructured form. After storing raw data, transformation process takes place at same layer. For further analysis, it is required data to be transformed and cleaned.

This process includes operations such as cleaning, normalization, and modification:

Removing and correcting errors and inconsistencies in the data.
Including additional information.
In Normalization, data arranged in an easier way to understand, ensuring consistency.

After transformation process, the data will be modified, clean, and organized as per requirements and referred as “trusted data or true data or processed data”. It will Become more reliable and suitable for further analysis or machine learning models.

Analytical Sandboxes: An analytical sandbox layer is secured, testing environment. This layer provides a dedicated environment within a data lake architecture, where data scientists, analysts or researchers tests and experiment and explore the data and derive insights, without compromising the integrity or quality of the data. Transformed data and raw data both imported into the analytical sandboxes.
Data Consumer: As we delve further into the architecture, we reach the Data consumer layer. At this layer, the data is accessible and available once all preceding processes have been completed. This is where the data is ready for analysis and insights generation.

End users, business analysts and data scientists access the data stored in the data lake to perform various task such as.

Data Analysis
Machine learning models
Data visualization and reporting
Artificial Intelligence
research
Decision Making

Data Governance, Security, And Monitoring

The overall data flow in a data lake is dependent on an oversight layer of governance, security, and monitoring. It is not one that can be bought off-the-shelf but normally implemented through a combination of configurations, third-party tools, and specialized team.

Data Governance: Governance ensures that there are rules in place to control the handling of data as well as its quality and usability. This promotes consistency of information and responsible usage. For instance, Apache Atlas or Collibra can be used to add the governance layer which helps include more robust policy management and metadata tagging (information about the stored data).
Data Security: Data protection laws are enforced by security protocols to prevent unauthorized access to these files. Varonis or McAfee Total Protection for Data Loss Prevention among others can be integrated into your data lake to enhance this aspect.
Monitoring: Monitoring is part of ELT (Extract, Load, Transform) processes that oversee the transfer of raw data into more usable formats. All these activities should be streamlined without any impact on performance using tools like Talend or Apache NiFi for high performance standards even as they carry out enormous tasks.

How To Create an AWS Date Lake: A Step-By-Step Guide

We are going to create an AWS Data Lake, using a combination of AWS services. AWS services will be using are:

AWS Glue: for performing the ETL job, processing job and cataloging of the data.

Lake Formation: to provide access control over the data.

Amazon Athena: for querying and analyzing the data in amazon S3 bucket.

Amazon S3: To store our data.

Step 1: Create IAM User

we need to create IAM user first, for controlled access to AWS services. we are going to create IAM user namely "Amazon-sales-user" for our dataset.
Search for IAM (Identity and Access Management) in AWS Console Search bar and navigate to IAM.

Creating IAM User

Click on "Users" option from the menu and click on "create User" button.

Creating User

Enter User Name in the user name box and click on "Next"

Define User Details

Now, we have to give permission to user, select "Attach policies directly" option to set the permissions.
Search and Select below permissions:
1. AmazonS3FullAccess
2. AmazonAthenaFullAccess
3. AWSCloudFormationReadOnlyAccess
4. AWSGlueConsoleFullAccess
5. CloudWatchLogsReadOnlyAccess
After selecting Permission Policies, click on "Next", Review the User Details and hit the "Create User" button.

Permissions Summary

The following screenshot, successful creation of User.

Successful Creation Of User

Step 2: Create IAM Role

After creating IAM User, now we have to create IAM Role, to catalog the data which is stored in Amazon S3 Bucket for our Data Lake.
Navigate to IAM Console again. click on the "Roles" option from the menu, which you will find on your left-hand side then click on "Create role" tab.

Creating IAM Role

Next, select "AWS Service" option. and type "Glue" as the AWS service in Use case or service box and click on "Next" button.

Selecting AWS Services

Now, we have to add permissions, search for "PowerUserAccess" policy and click on Next button.

Adding permissiosn To Role

On next screen, you will have to enter "Role Name" as per your wish. scroll down and click on "Create Role" button.

Review and Create Role

And our IAM Role is Successfully Created.

Step 3: Create S3 Bucket to Store the Data

We have successfully created our IAM users and IAM role for our AWS Data Lake, now to store our data we need to create Amazon S3 Bucket. in this demonstration we are uploading data manually into the S3.
Search for Amazon S3 in AWS Management Console Search Bar and navigate to the S3 Console.

Creating S3 Bucket

Click on "Create Bucket" button and create a bucket with a name of your choice, after entering bucket name click on "Create Bucket".

Assigning Bucket name To Creating Bucket

Choose Default encryption as server side encryption and bucket key as disable mode.

Choosing bucket configuration

The following illustrates that we successful created the bucket.

Our bucket is now created. select your bucket to open it. click on "Upload" button to upload our data file in the created bucket. click on "add file" tab choose your data file and click on "upload".

Uploading file in bucket

Upload the files as shown in the figure, by clicking on the upload files option as shown in the figure. And our data is ready!

Step 4: Data Lake Set Up using AWS Lake Formation

Our data is ready to ingest into the data lake. now will begin to set up our Data Lake. in data lake we will create a database. Search and navigate to the AWS Lake formation console.
Add administrator that performs administrative tasks of data lake. click on "Add Administrators" button to add administrators for your data lake (if you are working with AWS Lake Formation for the First time, only then "Add administrators" window will pop up).
Administrator is added, now it's time to create a database. you will find the option to create a database in left hand side menu click on "Databases" and under databases click on "Create database" button.

Data Lake Setup Using AWS Lake Formation

Enter Database Name as per your wish. after that you have to browse and provide your S3 bucket path in which your data is stored, in the "Location" box.

Creating Data base

Also make sure to uncheck the "Use Only IAM Access Control for New tables in this database" checkbox. after that click on "Create Database" button. and here you go your database is created in no time.

Defining Database Details

Database is created, now we have to register our S3 bucket as a storage for our data lake. for that find and click on "Data Lake locations" option from the left-hand side menu, click on "Register Location", browse and enter S3 bucket path where data is stored. after giving S3 path, choose IAM role as " AWSServiceRoleForLakeFormationDataAccess" by default and click on "Register Location".

Registering Location

Step 5: Data Cataloging using AWS Glue Crawlers

While building the Data Lake, it is essential for data in the data Lake should be catalogued. using AWS Glue the process of data cataloging becomes easy.
AWS Glue provides ETL (Extract, Transform, Load) service, meaning AWS Glue first transform, cleanse and organize data coming from multiple data sources before loading data into the Data Lake. AWS Glue makes data preparation process efficient by automating the ETL jobs.
AWS Glue offers crawlers which automates the data catalog process, for better discovery, search and query big data.
To create a data catalog in the Database, AWS Glue Crawler will use IAM role which we have created in previous step.
Go back to the AWS Lake Formation console again, click on "Databases" option you will see your previously created database. select your database and you will see an "Action" button, under Action Dropdown menu click on "Grant" option.

Granting permission to DB

On the next window, you have to choose your previously created IAM Role for "IAM Users & roles". scroll down you will see Database Permissions field, check boxes for only "Create Table" and "Alter" permissions and click on "Grant" button.

Grant data lake permissions

Scroll down you will see Database Permissions field, check boxes for only "Create Table" and "Alter" permissions and click on "Grant" button.

Define Database permissions

After that, navigate to AWS Glue console, on the left-hand side menu you will see the "Data catalog" option under "Data Catalog" you will find "Crawlers" option click on the that then click on "Create Crawler" button, Enter Name for your Crawler of your choice, you can also add description if you want, and then click on "Next".

Creating Crawler

Set the crawler properties as shown in the bleow screenshot.

Setting crawler properties

Clicking on "Next", Choose data sources and classifiers window will open, we have to choose the data source of data to be crawled. for S3 path, browse and provide S3 bucket path in which our data exist and click on "Add an S3 Data Source" your data source is now added now click on "Next".

Choosing data sources and classifiers

Add the data source and location of S3 data as shown in the below screenshot.

Add data source

On the next screen, we need to add IAM role, choose previously created IAM Role from the drop-down list and click on the "Next".

Configuring security settings

For Set output and scheduling, choose our created Database, for Crawler schedule select "On Demand" as Frequency and click on "Next".

Setting ouput and scheduling

Finally, review all the AWS Glue Crawler configuration and click on "Create Crawler" button to save and create the Crawler. Crawler is now ready! it may take few seconds to finish crawling the S3 bucket, after that you will see tables created successfully and automatically by the crawler in Database.
Navigate to the AWS Lake Formation console, click on "tables" from the menu, you can check here also table is created.

Step 6: Data Query with Amazon Athena

Amazon Athena is a Query Service offered by AWS, Amazon Athena allows us to analyze data which stored in Amazon S3 Bucket efficiently using Standard SQL.
When we are working with a large amount of data, we need some sort of querying tool for analyzing the data or big data, and here is where Amazon Athena comes into play, using Amazon Athena makes it easy for analyzing the data present in Amazon S3 Bucket.
When we are using Amazon Athena, we don't need to be good at SQL (Structured Query Language) for querying data, by default Athena supports Standard SQL Query language, because of that data analysts, data scientists and organizations are able to perform analytics and derive valuable insights from the data.
Amazon Athena allows user to query data stored in Amazon S3 in its original format. Navigate to the Amazon Athena Console.

Amazon Athena

Click on "Query Editor", select Database which we have created in the earlier steps, but before executing any query we need to provide "Query Result Location" which is Amazon S3 Bucket.
Amazon Athena stores Query Output and Metadata for each Query which executes in "Query Result Location".
we have to create S3 bucket to store our Query results in this bucket, click on "Set up a query result location in Amazon S3" tab and provide S3 bucket's path and hit the "Save" button.
We have added the "Query Result Location", Now we can Run our Queries in Amazon Athena Query Editor.
Run the following MySQL Query and click on "Run" button.

SELECT * FROM "gfg-data-lake-db" . "gfg-data-lake-bucket" limit 10;

Fetching data through query

Output of above Query illustrated by the following screenshot.

Output of the above query

Step 7: Clean Up

After following Numbers of steps, we have Successfully Created our AWS Data Lake with the Combination of Different AWS Services. now it's time to clean up all the created Resources to avoid any unnecessary large bills.

Delete all the created AWS Resources including:

Amazon S3 Buckets
IAM Users and Roles
AWS Glue Crawler
Database created in AWS Lake Formation
Delete the Registered Locations

How to Create AWS Data Lake