Explain Hadoop File System along with its architecture.

- September 13, 2022

Hadoop File System(HDFS)

The Hadoop File System was created with a distributed file system design. It runs on standard hardware. In contrast to other distributed systems, HDFS is extremely faulted tolerant and built with low-cost hardware. HDFS stores a big quantity of data and makes it easy to access. To store such large amounts of da the files are spread over numerous computers. These files are kept in a redundant form to protect the system from data loss in the event of a breakdown. HDFS also enables parallel processing of applications.

Hadoop applications use the Hadoop Distributed File Solution (HDFS) as their primary data storage system. It implements a distributed file system that allows high-performance access to data across highly scalable Hadoop clusters using a NameNode and DataNode architecture. HDFS is an important component of the numerous Hadoop ecosystem technologies since it provides a dependable way of maintaining massive data pools and supporting associated big data analytics applications.

Features of HDFS

It is suitable for distributed storage and processing.
Hadoop provides a command interface to interact with HDFS.
The built-in servers of namenode and datanode help users to easily check the status of the cluster.
Streaming access to file system data.
HDFS provides file permissions and authentication.

HDFS Architecture

The architecture of a Hadoop File System is shown in figure 7.2. Those components of the architecture are described as below:

Namenode

The namenode is the commodity hardware that contains the GNU/Linux operating system and the namenode software. It is software that can be run on commodity hardware. The system having the namenode acts as the master server and it does the following tasks:
•Manages the file system namespace.
•Regulates client's access to files.
•It also executes file system operations such as renaming, closing, and opening files and directories.

Datanode

The datanode is a piece of commodity hardware that runs the GNU/Linux operating system as well as the datanode software. A datanode will exist for each node (Commodity hardware/System) in a cluster. These nodes are in charge of their system's data storage.
Datanodes perform read-write operations on the file systems, as per the client's request.
They also perform operations such as block creation, deletion, and replication according to the instructions of the namenode.

Block

In general, user data is saved in HDFS files. In a file system, the file will be partitioned into one or more segments and/or stored in separate data nodes. Blocks are the name given to these file chunks. In othe terms, a Block is the smallest quantity of data that HDFS can read or write. The default block size is 64MB, although it may be adjusted in the HDFS setup as needed.

Goals of HDFS

Fault detection and recovery: Because HDFS relies on a large number of commodity hardware components, component failure is common. As a result, HDFS should have techniques for rapid and automated failure detection and recovery.
Huge datasets: To manage applications with large datasets, HDFS should have hundreds of nodes per cluster.
Hardware at data: When the computation takes place near the data, a requested job can be completed quickly. It decreases network traffic while increasing throughput, especially when large datasets are involved.

Search This Blog

Notes for BSc CSIT