Explain Apache Hadoop in detail.

 Apache Hadoop

  • Apache Hadoop is a Java-based open-source software framework for scalable and distributed computing that manages data processing and storage in large data applications. Hadoop distributes big data sets and analytical jobs among nodes in a computing cluster, breaking them down into smaller tasks that may be handled concurrently. Hadoop can process both organized and unstructured data and scale up from a single server to thousands of computers in a reliable manner.
  • The Apache Hadoop software library provides a platform for distributed processing of massive data volumes across computer clusters using simple programming techniques. It is intended to grow from a single server to thousands of computers, each of which provides local computing and storage. Rather than relying on hardware to provide high availability, the library is designed to identify and manage problems at the application layer, giving a highly available service on top of a cluster of machines, each of which may fail.
  • Hadoop was a business to examine and query large data sets in a scalable manner utilizing free, open-source software and low-cost off-the-shelf hardware. This was an important breakthrough because it provided a viable alternative to the proprietary data warehouse (DW) systems and closed data formats that had previously dominated the market. With the launch of Hadoop, enterprises soon gained access to the capacity to store and analyze massive volumes of data, higher computation capacity, fault tolerance, data management flexibility, cheaper costs compared to DWs, and more scalability - simply add more nodes. Finally, Hadoop laid the stage for future big data analytics breakthroughs, such as the release of Apache Spark.


The term Hadoop is a general term that may refer to any of the following:

  • The overall Hadoop ecosystem encompasses both the core modules and related sub-modules.
  • The core Hadoop modules include Hadoop Distributed File System (HDFS), Yet Another Resource Negotiator (YARN), MapReduce, and Hadoop Common. These are the basic building blocks of a typical Hadoop deployment.
  • Hadoop-related sub-modules such as Apache Hive™, Apache Impala™, Apache Pig™, and Apache Zookeeper™. These related pieces of software can be used to customize, improve upon, or extend the functionality of core Hadoop.

Examples of Popular Hadoop-related Software 

Popular Hadoop packages that are not strictly a part of the core Hadoop modules, but that are frequently used in conjunction with them, include: 

  • Apache Hive is data warehouse software that runs on Hadoop and enables users to work with data in HDFS using a SQL-like query language called HiveQL
  • Apache Impala is the open-source, native analytic database for Apache Hadoop. 
  • Apache Pig is a tool that is generally used with Hadoop as an abstraction over MapReduce to analyze large sets of data represented as data flows. Pig enables operations like join, filter, sort, load, etc.
  • Apache Zookeeper is a centralized service for enabling highly reliable distributed processing. Apache Sqoop™ is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. 
  • Apache Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Oozie Workflow jobs are Directed at Acyclical Graphs (DAGS) of actions.


Comments

Popular posts from this blog

Suppose that a data warehouse for Big-University consists of the following four dimensions: student, course, semester, and instructor, and two measures count and avg_grade. When at the lowest conceptual level (e.g., for a given student, course, semester, and instructor combination), the avg_grade measure stores the actual course grade of the student. At higher conceptual levels, avg_grade stores the average grade for the given combination. a) Draw a snowflake schema diagram for the data warehouse. b) Starting with the base cuboid [student, course, semester, instructor], what specific OLAP operations (e.g., roll-up from semester to year) should one perform in order to list the average grade of CS courses for each BigUniversity student. c) If each dimension has five levels (including all), such as “student < major < status < university < all”, how many cuboids will this cube contain (including the base and apex cuboids)?

Suppose that a data warehouse consists of the four dimensions; date, spectator, location, and game, and the two measures, count and charge, where charge is the fee that a spectator pays when watching a game on a given date. Spectators may be students, adults, or seniors, with each category having its own charge rate. a) Draw a star schema diagram for the data b) Starting with the base cuboid [date; spectator; location; game], what specific OLAP operations should perform in order to list the total charge paid by student spectators at GM Place in 2004?

What is national data warehouse? What is census data?