Explain Hadoop File System along with its architecture.

Hadoop File System(HDFS)

 The Hadoop File System was created with a distributed file system design. It runs on standard hardware. In contrast to other distributed systems, HDFS is extremely faulted tolerant and built with low-cost hardware. HDFS stores a big quantity of data and makes it easy to access. To store such large amounts of da the files are spread over numerous computers. These files are kept in a redundant form to protect the system from data loss in the event of a breakdown. HDFS also enables parallel processing of applications.

Hadoop applications use the Hadoop Distributed File Solution (HDFS) as their primary data storage system. It implements a distributed file system that allows high-performance access to data across highly scalable Hadoop clusters using a NameNode and DataNode architecture. HDFS is an important component of the numerous Hadoop ecosystem technologies since it provides a dependable way of maintaining massive data pools and supporting associated big data analytics applications.


Features of HDFS

  • It is suitable for distributed storage and processing. 
  • Hadoop provides a command interface to interact with HDFS.
  • The built-in servers of namenode and datanode help users to easily check the status of the cluster. 
  • Streaming access to file system data.
  • HDFS provides file permissions and authentication.


 HDFS Architecture

The architecture of a Hadoop File System is shown in figure 7.2. Those components of the architecture are described as below:



Namenode

The namenode is the commodity hardware that contains the GNU/Linux operating system and the namenode software. It is software that can be run on commodity hardware. The system having the namenode acts as the master server and it does the following tasks:

•Manages the file system namespace. 

•Regulates client's access to files.

•It also executes file system operations such as renaming, closing, and opening files and directories.


Datanode

The datanode is a piece of commodity hardware that runs the GNU/Linux operating system as well as the datanode software. A datanode will exist for each node (Commodity hardware/System) in a cluster. These nodes are in charge of their system's data storage.

Datanodes perform read-write operations on the file systems, as per the client's request. 

They also perform operations such as block creation, deletion, and replication according to the instructions of the namenode.


Block

In general, user data is saved in HDFS files. In a file system, the file will be partitioned into one or more segments and/or stored in separate data nodes. Blocks are the name given to these file chunks. In othe terms, a Block is the smallest quantity of data that HDFS can read or write. The default block size is 64MB, although it may be adjusted in the HDFS setup as needed.


Goals of HDFS

Fault detection and recovery: Because HDFS relies on a large number of commodity hardware components, component failure is common. As a result, HDFS should have techniques for rapid and automated failure detection and recovery.

Huge datasets: To manage applications with large datasets, HDFS should have hundreds of nodes per cluster.

Hardware at data: When the computation takes place near the data, a requested job can be completed quickly. It decreases network traffic while increasing throughput, especially when large datasets are involved.

Comments

Popular posts from this blog

Suppose that a data warehouse for Big-University consists of the following four dimensions: student, course, semester, and instructor, and two measures count and avg_grade. When at the lowest conceptual level (e.g., for a given student, course, semester, and instructor combination), the avg_grade measure stores the actual course grade of the student. At higher conceptual levels, avg_grade stores the average grade for the given combination. a) Draw a snowflake schema diagram for the data warehouse. b) Starting with the base cuboid [student, course, semester, instructor], what specific OLAP operations (e.g., roll-up from semester to year) should one perform in order to list the average grade of CS courses for each BigUniversity student. c) If each dimension has five levels (including all), such as “student < major < status < university < all”, how many cuboids will this cube contain (including the base and apex cuboids)?

Suppose that a data warehouse consists of the four dimensions; date, spectator, location, and game, and the two measures, count and charge, where charge is the fee that a spectator pays when watching a game on a given date. Spectators may be students, adults, or seniors, with each category having its own charge rate. a) Draw a star schema diagram for the data b) Starting with the base cuboid [date; spectator; location; game], what specific OLAP operations should perform in order to list the total charge paid by student spectators at GM Place in 2004?

Explain network topology .Explain tis types with its advantages and disadvantges.