Explain different core modules and examples of Hadoop-related Software .

 The core modules of Hadoop include:

  • Hadoop Common: The common utilities that support the other Hadoop modules. Hadoop Distributed File System (HDFS): A distributed file system that provides high-throughput access to application data.
  • Hadoop Yet Another Resource Negotiator (YARN): A framework for job scheduling and cluster resource management. Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
  • Hadoop Ozone: An object store for Hadoop Examples of Popular Hadoop-related Software Popular Hadoop packages that are not strictly a part of the core Hadoop modules, but that is frequently used in conjunction with them, include: Apache Hive is data warehouse software that runs on Hadoop and enables users to work with data in HDFS using a SQL-like query language called HiveQL.



Examples of Popular Hadoop-related Software 

Popular Hadoop packages that are not strictly a part of the core Hadoop modules, but that are frequently used in conjunction with them, include: 

  • Apache Hive is data warehouse software that runs on Hadoop and enables users to work with data in HDFS using a SQL-like query language called HiveQL
  • Apache Impala is the open-source, native analytic database for Apache Hadoop. 
  • Apache Pig is a tool that is generally used with Hadoop as an abstraction over MapReduce to analyze large sets of data represented as data flows. Pig enables operations like join, filter, sort, load, etc.
  • Apache Zookeeper is a centralized service for enabling highly reliable distributed processing. Apache Sqoop™ is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. 
  • Apache Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Oozie Workflow jobs are Directed at Acyclical Graphs (DAGS) of actions.

Comments

Popular posts from this blog

Suppose that a data warehouse for Big-University consists of the following four dimensions: student, course, semester, and instructor, and two measures count and avg_grade. When at the lowest conceptual level (e.g., for a given student, course, semester, and instructor combination), the avg_grade measure stores the actual course grade of the student. At higher conceptual levels, avg_grade stores the average grade for the given combination. a) Draw a snowflake schema diagram for the data warehouse. b) Starting with the base cuboid [student, course, semester, instructor], what specific OLAP operations (e.g., roll-up from semester to year) should one perform in order to list the average grade of CS courses for each BigUniversity student. c) If each dimension has five levels (including all), such as “student < major < status < university < all”, how many cuboids will this cube contain (including the base and apex cuboids)?

Suppose that a data warehouse consists of the four dimensions; date, spectator, location, and game, and the two measures, count and charge, where charge is the fee that a spectator pays when watching a game on a given date. Spectators may be students, adults, or seniors, with each category having its own charge rate. a) Draw a star schema diagram for the data b) Starting with the base cuboid [date; spectator; location; game], what specific OLAP operations should perform in order to list the total charge paid by student spectators at GM Place in 2004?

What is national data warehouse? What is census data?