Explain Google MapReduce infrastructure along with it's variants.

GOOGLE MAPREDUCE INFRASTRUCTURE

The user submits MapReduce tasks for execution by utilizing client libraries, which are responsible for sending input data files, registering the map and reducing functions, and returning control to the user once the task is done. MapReduce applications may be executed on a general distributed infrastructure with job scheduling and distributed storage capabilities. On the distributed infrastructure, two types of processes are run: master processes and worker processes.

The master process is responsible for directing the execution of map and reduce tasks, as well as partitioning and rearranging the map task's intermediate output to feed the reduced tasks. The worker processes are used to host the execution of map and reduce operations, as well as to offer fundamental I/O facilities for interacting with input and output files. In a MapReduce calculation, input files are divided into splits of usually 16 to 64 MB and stored in a distributed file system (i.e., HDFS). By balancing the load, the master process produces the map tasks and allocates input splits to each of them.

Input and output buffers are utilized by worker processes to optimize the efficiency of the map and reduce operations. Output buffers for map operations are dumped to disk regularly to produce intermediate files To equally separate the output of map operations, intermediate files are partitioned using a function. The positions of these pairings are then sent to the master process, which passes this information to the reduced task which may gather needed input through a remote procedure call to read from the map tasks' local storage. The key range is then sorted, and any keys that have the same value are grouped. Finally, the reduction job is run to generate the final result, which is saved in the global file system. This procedure is fully automated; users may control it by providing (in addition to the map and reduce functions) the number of map jobs, the number of partitions into which the final output is divided, and the partition function for the intermediate key range.

The MapReduce runtime ensures application reliability by providing a fault-tolerant architecture. Faults of both master and worker processes, as well as machine failures that make intermediate outputs unavailable, are also addressed. Worker failures are addressed by rescheduling map work to another location. This is also the approach used to resolve machine failures when the legitimate intermediate output of map jobs is no longer available. Checkpointing is used instead to address master process failure, allowing the MapReduce task to be restarted with little data and computation loss.

As discussed earlier, MapReduce is a simplified paradigm for the processing of big data. Although the paradigm is being used in a variety of scenarios, it puts limitations on how distributed algorithms should be structured to execute over a specified working mechanism of MapReduce architecture. The abstractions offered by MapReduce to process data are relatively simple but complex problems may take significant effort to be described in terms of map and reduce functions alone. As a result, several modifications and variants to the original MapReduce architecture have been proposed, to expand the MapReduce application area and give developers a more user-friendly interface for building distributed algorithms

. Hadoop, Pig, Hive, Map-Reduce-Merge, etc. are some of those modified variants.

1. Apache Hadoop is a series of software projects that enable scalable and reliable distributed computing. Hadoop as a whole is an open-source implementation of the MapReduce architecture. It mainly consists of two projects: Hadoop Distributed File System (HDFS) and Hadoop MapReduce.HDFS is an implementation of the Google file system and Hadoop MapReduce offers the same functionality and abstraction as Google MapReduce. Originally developed and supported by Yahoo, Hadoop is currently the most mature and comprehensive data cloud application with a very active user community. has the support of developers and users. Yahoo operates the world's largest Hadoop cluster, consisting of 40,000 machines and more than 300,000 cores, which can be used by academic institutions around the world.

2. Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs and infrastructure for evaluating these programs. The outstanding property of Pig programs is that their structure is susceptible to significant parallelization, which in turn allows them to process very large data sets. Pig's infrastructure layer currently consists of a compiler that generates sequences of map-reduce programs for which large-scale parallel implementations are already in place. Pig's language layer currently consists of a text language called Pig Latin, which has key features such as simple programming, optimization possibilities, and expandability.

3. Apache Hive data warehouse program simplifies the reading, writing, and management of huge data sets on distributed storage using SQL. It includes tools for quick data summary, ad hoc searches, and analyzing big datasets stored in Hadoop MapReduce files. The Hive architecture has identical features to a traditional data warehouse, but it does not perform well in terms of c latency, so it is not a viable option for online transaction processing. Built on top of Apache query Hadoop, Hive provides the tools to enable easy access to data via SQL, a mechanism to impose structure on a variety of data formats, and access to files stored either directly in Apache HDFS or other data storage systems such as Apache HBase, Query execution via Apache Tez, Apache Spark, or MapReduce, etc.Hive's main advantages are its capacity to scale out, as it is built on the Hadoop architecture, and its ability to provide a data warehouse infrastructure in situations where a Hadoop system is already running.

4. Map-Reduce-Merge is a change of the MapReduce paradigm that introduces a third phase to the traditional MapReduce pipeline, which is 'Merge' that lets in effectively combining data that has formerly been partitioned and taken care of through map and reduce modules. The Map-Reduce-Merge framework facilitates the handling of heterogeneous linked datasets by providing an abstraction that can express standard relational algebra operations as well as numerous join methods

Search This Blog

Notes for BSc CSIT