Explain benefits and challenges of Hadoop.

 Benefits of Hadoop

Scalability: Unlike traditional systems, which have storage limitations, Hadoop is scalable since it functions in a distributed environment. This enabled data architects to create early Hadoop data lakes. 

Resilience: The Hadoop Distributed File System (HDFS) is intrinsically robust. To prepare for the risk of hardware or software failures, data stored on each node of a Hadoop cluster is also duplicated on other nodes of the cluster. This design is purposely redundant to ensure failure tolerance. If one node fails, there is always a backup of the data in the cluster.

Flexibility: Unlike typical relational database management systems, Hadoop allows you to store data in any format, including semi-structured or unstructured data. Hadoop allows organizations to readily access new data sources and access various sorts of data. 


Challenges with Hadoop Architectures

Complexity: Hadoop is a low-level, Java-based platform that might be too complicated to deal with for end users. Hadoop infrastructures can also need a substantial amount of skill and resources to set up, maintain, and update.

Performance: Hadoop performs calculations by doing frequent reads and writes to disk, which is time-consuming and inefficient when compared to frameworks that attempt to store and process data in memory as much as possible, such as Apache Spark.

Viability in the long run: The world witnessed a tremendous unraveling inside the Hadoop realm in 2019. Google, whose groundbreaking 2004 article on MapReduce served as the foundation for the construction of Apache Hadoop, has discontinued using MapReduce entirely. In the Hadoop industry, there were also some high-profile mergers and acquisitions. Furthermore, in 2020, a prominent Hadoop supplier switched its product set away from being Hadoop-centric, since Hadoop is now considered to be "more of a mindset than a technology."



Comments

Popular posts from this blog

Suppose that a data warehouse for Big-University consists of the following four dimensions: student, course, semester, and instructor, and two measures count and avg_grade. When at the lowest conceptual level (e.g., for a given student, course, semester, and instructor combination), the avg_grade measure stores the actual course grade of the student. At higher conceptual levels, avg_grade stores the average grade for the given combination. a) Draw a snowflake schema diagram for the data warehouse. b) Starting with the base cuboid [student, course, semester, instructor], what specific OLAP operations (e.g., roll-up from semester to year) should one perform in order to list the average grade of CS courses for each BigUniversity student. c) If each dimension has five levels (including all), such as “student < major < status < university < all”, how many cuboids will this cube contain (including the base and apex cuboids)?

Suppose that a data warehouse consists of the four dimensions; date, spectator, location, and game, and the two measures, count and charge, where charge is the fee that a spectator pays when watching a game on a given date. Spectators may be students, adults, or seniors, with each category having its own charge rate. a) Draw a star schema diagram for the data b) Starting with the base cuboid [date; spectator; location; game], what specific OLAP operations should perform in order to list the total charge paid by student spectators at GM Place in 2004?

What is national data warehouse? What is census data?