Explain the features of Hadoop.

 Features of Hadoop 

1. Open Source:

Hadoop is open-source, which means it is free to use. Since it is an open-source project the source code is available online for anyone to understand it or make some modifications as per their industry requirement.


2. Highly Scalable Cluster:

Hadoop is a highly scalable model. A large amount of data is divided into multiple inexpensive machines in a cluster which is processed parallelly. the number of these machines or nodes can be increased or decreased as per the enterprise’s requirements. In traditional RDBMS(Relational DataBase Management System) the systems can not be scaled to approach large amounts of data.


3. Fault Tolerance is Available:

Hadoop uses commodity hardware(inexpensive systems) which can be crashed at any moment. In Hadoop data is replicated on various DataNodes in a Hadoop cluster which ensures the availability of data if somehow any of your systems got crashed. You can read all of the data from a single machine if this machine faces a technical issue data can also be read from other nodes in a Hadoop cluster because the data is copied or replicated by default. By default, Hadoop makes 3 copies of each file block and stores it in different nodes. This replication factor is configurable and can be changed by changing the replication property in the hdfs-site.xml file.



4. High Availability is Provided:

Fault tolerance provides High Availability in the Hadoop cluster. High Availability means the availability of data on the Hadoop cluster. Due to fault tolerance in case, any of the DataNode goes down the same data can be retrieved from any other node where the data is replicated. The High available Hadoop cluster also has 2 or more two Name Node i.e. Active Name Node and Passive NameNode also known as stand by NameNode. In case Active NameNode fails then the Passive node will take the responsibility of Active Node and provide the same data as that of Active NameNode which can easily be utilized by the user.


5. Cost-Effective:

Hadoop is open-source and uses cost-effective commodity hardware which provides a cost-efficient model, unlike traditional Relational databases that require expensive hardware and high-end processors to deal with Big Data. The problem with traditional Relational databases is that storing a Massive volume of data is not cost-effective, so companies started to remove the Raw data. which may not result in the correct scenario for their business. This means Hadoop provides us 2 main benefits with the cost one is its open-source means free to use and the other is that it uses commodity hardware which is also inexpensive.


6. Hadoop Provide Flexibility:

Hadoop is designed in such a way that it can deal with any kind of dataset like structured(MySql Data), Semi-Structured(XML, JSON), or Un-structured (Images and Videos) very efficiently. This means it can easily process any kind of data independent of its structure which makes it highly flexible. It is very much useful for enterprises as they can process large datasets easily, so businesses can use Hadoop to analyze valuable insights of data from sources like social media, email, etc. With this flexibility, Hadoop can be used with log processing, Data Warehousing, Fraud detection, etc.


7. Easy to Use:

Hadoop is easy to use since the developers need not worry about any of the processing work since it is managed by Hadoop itself. Hadoop ecosystem is also very large and comes up with lots of tools like Hive, Pig, Spark, HBase, Mahout, etc.


8. Hadoop uses Data Locality:

The concept of Data Locality is used to make Hadoop processing fast. In the data locality concept, the computation logic is moved near data rather than moving the data to the computation logic. The cost of Moving data on HDFS is the costliest and with the help of the data locality concept, the bandwidth utilization in the system is minimized.


9. Provides Faster Data Processing:

Hadoop uses a distributed file system to manage its storage i.e. HDFS(Hadoop Distributed File System). In DFS(Distributed File System) a large size file is broken into small size file blocks then distributed among the Nodes available in a Hadoop cluster, as this massive number of file blocks are processed parallelly which makes Hadoop faster, because of which it provides a High-level performance as compared to the traditional DataBase Management Systems.

OR,

The top 8 features of Hadoop are:

  • Cost Effective System
  • Large Cluster of Nodes
  • Parallel Processing
  • Distributed Data
  • Automatic Failover Management
  • Data Locality Optimization
  • Heterogeneous Cluster
  • Scalability


1) Cost Effective System

Hadoop framework is a cost-effective system, that is, it does not require any expensive or specialized hardware in order to be implemented. In other words, it can be implemented on any single hardware. These hardware components are technically referred to as commodity hardware.


2) Large Cluster of Nodes

It supports a large cluster of nodes. This means a Hadoop cluster can be made up of millions of nodes. The main advantage of this feature is that it offers huge computing power and a huge storage system to the clients.


3) Parallel Processing

It supports the parallel processing of data. Therefore, the data can be processed simultaneously across all the nodes in the cluster. This saves a lot of time.


4) Distributed Data

The Hadoop framework takes care of distributing and splitting the data across all the nodes within a cluster. It also replicates the data over the entire cluster.


5) Automatic Failover Management

In case a particular machine within the cluster fails then the Hadoop network replaces that particular machine with another machine. It also replicates the configuration settings and data from the failed machine to the new machine. Once this feature has been properly configured on a cluster then the admin need not worry about it.


6) Data Locality Optimization

In a traditional approach whenever a program is executed the data is transferred from the data center into the machine where the program is getting executed. For instance, assume the data executed in a program is located in a data center in the USA, and the program that requires this data is in Singapore. Suppose the data required is about 1 PB in size. Transferring huge data of this size from the USA to Singapore would consume a lot of bandwidth and time. Hadoop eliminates this problem by transferring the code which is a few megabytes in size. It transfers this code located in Singapore to the data center in the USA. Then it compiles and executes the code locally on that data. This process saves a lot of time and bandwidth. It is one of the most important features of Hadoop.


7) Heterogeneous Cluster

It supports heterogeneous clusters. It is also one of the most important features offered by the Hadoop framework. A heterogeneous cluster refers to a cluster where each node can be from a different vendor. Each of these can be running a different version and a different flavor of the operating system. For example, consider a cluster is made up of four nodes. The first node is an IBM machine running on RHEL (Red Hat Enterprise Linux), the second node is an Intel machine running on UBUNTU Linux, the third node is an AMD machine running on Fedora Linux, and the last node is an HP machine running on CENTOS Linux

8) Scalability

It refers to the ability to add or remove the nodes as well as add or remove the hardware components to, or, from the cluster. This is done without affecting or bringing down the cluster operation. Individual hardware components like RAM or hard drive can also be added or removed from a cluster.

Comments

Popular posts from this blog

Suppose that a data warehouse for Big-University consists of the following four dimensions: student, course, semester, and instructor, and two measures count and avg_grade. When at the lowest conceptual level (e.g., for a given student, course, semester, and instructor combination), the avg_grade measure stores the actual course grade of the student. At higher conceptual levels, avg_grade stores the average grade for the given combination. a) Draw a snowflake schema diagram for the data warehouse. b) Starting with the base cuboid [student, course, semester, instructor], what specific OLAP operations (e.g., roll-up from semester to year) should one perform in order to list the average grade of CS courses for each BigUniversity student. c) If each dimension has five levels (including all), such as “student < major < status < university < all”, how many cuboids will this cube contain (including the base and apex cuboids)?

Suppose that a data warehouse consists of the four dimensions; date, spectator, location, and game, and the two measures, count and charge, where charge is the fee that a spectator pays when watching a game on a given date. Spectators may be students, adults, or seniors, with each category having its own charge rate. a) Draw a star schema diagram for the data b) Starting with the base cuboid [date; spectator; location; game], what specific OLAP operations should perform in order to list the total charge paid by student spectators at GM Place in 2004?

Explain network topology .Explain tis types with its advantages and disadvantges.