Describe the architecture of Hadoop in your own words.

 Hadoop is an open-source framework that allows to storage and process of big data in a distributed environment across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Hadoop is written in Java and is not OLAP (online analytical processing). It is used for batch/offline processing. It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn, and many more. Moreover, it can be scaled up just by adding nodes in the cluster.



Hadoop Architecture

The Hadoop architecture is a package of the file system, MapReduce engine, and the HDFS (Hadoop Distributed File System). The MapReduce engine can be MapReduce/MR1 or YARN/MR2.

A Hadoop cluster consists of a single master and multiple slave nodes. The master node includes Job Tracker, Task Tracker, NameNode, and DataNode whereas the slave node includes DataNode and TaskTracker.



Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop. It contains master/slave architecture. This architecture consists of a single NameNode performing the role of master, and multiple DataNodes performing the role of a slave.

NameNode

  • It is a single master server that exists in the HDFS cluster.
  • As it is a single node, it may become the reason for single-point failure.
  • It manages the file system namespace by executing an operation like opening, renaming, and closing the files.
  • It simplifies the architecture of the system.

DataNode

  • The HDFS cluster contains multiple DataNodes.
  • Each DataNode contains multiple data blocks.
  • These data blocks are used to store data.
  • It is the responsibility of DataNode to read and write requests from the file system's clients.
  • It performs block creation, deletion, and replication upon instruction from the NameNode.

Job Tracker

  • The role of Job Tracker is to accept the MapReduce jobs from clients and process the data by using NameNode.
  • In response, NameNode provides metadata to Job Tracker.

Task Tracker

  • It works as a slave node for Job Tracker.
  • It receives tasks and code from Job Tracker and applies that code on the file. This process can also be called a Mapper.

MapReduce Layer

The MapReduce comes into existence when the client application submits the MapReduce job to Job Tracker. In response, the Job Tracker sends the request to the appropriate Task Trackers. Sometimes, the TaskTracker fails or times out. In such a case, that part of the job is rescheduled.


Comments

Popular posts from this blog

Suppose that a data warehouse for Big-University consists of the following four dimensions: student, course, semester, and instructor, and two measures count and avg_grade. When at the lowest conceptual level (e.g., for a given student, course, semester, and instructor combination), the avg_grade measure stores the actual course grade of the student. At higher conceptual levels, avg_grade stores the average grade for the given combination. a) Draw a snowflake schema diagram for the data warehouse. b) Starting with the base cuboid [student, course, semester, instructor], what specific OLAP operations (e.g., roll-up from semester to year) should one perform in order to list the average grade of CS courses for each BigUniversity student. c) If each dimension has five levels (including all), such as “student < major < status < university < all”, how many cuboids will this cube contain (including the base and apex cuboids)?

Suppose that a data warehouse consists of the four dimensions; date, spectator, location, and game, and the two measures, count and charge, where charge is the fee that a spectator pays when watching a game on a given date. Spectators may be students, adults, or seniors, with each category having its own charge rate. a) Draw a star schema diagram for the data b) Starting with the base cuboid [date; spectator; location; game], what specific OLAP operations should perform in order to list the total charge paid by student spectators at GM Place in 2004?

Explain network topology .Explain tis types with its advantages and disadvantges.