Aquileo | Hadoop - Architecture

Hadoop is an open-source Java framework that stores and processes massive data using clusters of inexpensive (commodity) hardware. Based on Google’s MapReduce programming model, it enables distributed, parallel processing. Big companies like Facebook, Yahoo, Netflix and eBay use Hadoop to handle large-scale data efficiently.

Components of Hadoop Architecture

Hadoop Architecture Mainly consists of 4 components:

MapReduce
HDFS (Hadoop Distributed File System)
YARN (Yet Another Resource Negotiator)
Common Utilities or Hadoop Common

Let's understand role of each one of this component in detail.

1. MapReduce

MapReduce is a data processing model in Hadoop that runs on YARN. It enables fast, distributed and parallel processing by dividing tasks into two phases Map and Reduce making it efficient for handling large-scale data.

MapReduce Workflow: workflow begins when input data is split into key-value pairs by Map() function. These are then grouped by key and processed by Reduce() function for tasks like sorting or aggregation. The final output is written to HDFS.

Note: Core aggregation often happens in the Reducer, but preprocessing and filtering are ideally done in the Mapper to optimize performance.

Map Task Components:

RecordReader: reads input data and converts it into key-value pairs, with keys as location info and values as actual data.
Mapper: processes each pair and outputs zero or more new key-value pairs.
Combiner (optional): acts as a mini-reducer to group Mapper output and reduce data transfer before shuffling.
Partitioner: assigns key-value pairs to Reducers using key.hashCode() % numberOfReducers.

Reduce Task Components:

Shuffle and Sort: transfers intermediate key-value pairs from Mappers to Reducers and sorts them by key. Shuffling begins as soon as some Mappers finish.
Reducer: processes grouped key-value pairs, performing tasks like aggregation or filtering based on logic.
OutputFormat: writes final results to HDFS using a RecordWriter, typically storing each record as a key-value line.

2. HDFS

HDFS (Hadoop Distributed File System) is Hadoop’s primary storage system, built for high-throughput access to large datasets. It runs on inexpensive commodity hardware and stores data in large blocks to optimize performance. HDFS ensures fault tolerance and high availability across the cluster.

HDFS Architecture Components:

NameNode (Master Node): The master node in HDFS that stores metadata (not actual data), manages file operations and directs clients to nearest DataNode for efficient access.
DataNode (Slave Node): Stores actual data blocks, serves read/write requests and reports to NameNode. Supports replication (default 3) for fault tolerance and scales storage and performance with more nodes.

High Level Architecture Of Hadoop

High Level Architecture Of Hadoop File Block In HDFS: In HDFS data is always stored in the form of blocks. By default, each block is 128MB in size, although this value can be manually configured depending on the use case (commonly increased to 256MB or more in modern systems).

file blocks in HDFS

Suppose you upload a file of 400MB to HDFS. Hadoop will divide this file into blocks as follows:

128MB + 128MB + 128MB + 16MB = 400MB

This creates four blocks three of 128MB and one of 16MB. Hadoop splits files purely by size, not content, so a single record can span across two blocks.

1. Comparison with Traditional File Systems

Traditional file systems use small blocks (e.g., 4KB), while HDFS uses large blocks (128MB or more).
Larger blocks in HDFS reduce metadata and I/O overhead, improving scalability and efficiency for big data processing.

2. Replication In HDFS:

HDFS replication ensures data availability and fault tolerance by storing multiple copies of each block.

Default Replication Factor: 3 (configurable in hdfs-site.xml)
If a file is split into 4 blocks, with a replication factor of 3: 4 blocks × 3 replicas = 12 total blocks

Designed for commodity hardware, where failures are common—replication prevents data loss. While it increases storage usage, reliability is prioritized over space efficiency.

3. Rack Awareness:

A rack is a group of machines (typically 30–40) in a Hadoop cluster. Large clusters have many racks. Rack Awareness helps NameNode to:

Choose the nearest DataNode for faster read/write operations.
Reduce network traffic by minimizing inter-rack data transfer.

This improves overall performance and efficiency in data access.

HDFS Architecture

HDFS Architecture

3. YARN (Yet Another Resource Negotiator)

YARN is resource management layer in Hadoop ecosystem. It allows multiple data processing engines like MapReduce, Spark and others to run and share cluster resources efficiently.

It handles two core responsibilities:

Job Scheduling: Splits large tasks into smaller jobs, assigns them to nodes and manages priorities, dependencies and execution.
Resource Management: Allocates and monitors cluster resources (CPU, memory, etc.) needed for job execution.

Components of Yarn:

ResourceManager: Master node that manages global resource allocation.
NodeManager: Slave node that monitors resources on individual nodes.
ApplicationMaster: Manages lifecycle of each individual application/job.

Key Features of YARN:

Multi-Tenancy: Supports multiple users and applications.
Scalability: Efficiently scales to handle thousands of nodes and jobs.
Better Cluster Utilization: Maximizes resource usage across the cluster.
Compatibility: Works with MapReduce and other processing models like Spark.

4. Hadoop Common (Common Utilities)

Hadoop Common, also known as Common Utilities, includes core Java libraries and scripts required by all components in a Hadoop ecosystem such as HDFS, YARN and MapReduce.

These libraries offer core functionalities such as:

File system and I/O operations
Configuration and logging
Security and authentication
Network communication

Hadoop Common provides shared libraries and utilities that help all Hadoop components work together. It handles hardware failures automatically and includes tools like Hadoop Archive, native library support and RPC mechanisms.

Hadoop - Architecture

Components of Hadoop Architecture

1. MapReduce

2. HDFS

3. YARN (Yet Another Resource Negotiator)

4. Hadoop Common (Common Utilities)

Explore