List some of the important storage technologies that support data-intensive computing and describe one of them.

  •  The creation of applications that are primarily focused on processing massive amounts of data is referred to as data-intensive computing. 
  • Storage systems and programming models form a natural categorization of the technologies that enable data-intensive computing./ The important storage technologies that support data-intensive computing are:- Storage systems and programming models 

1) Storage Systems

  • Because of the growth of unstructured data in the form of blogs, Web pages, software logs, and sensor reading the relational model in its original formulation doesn't appear to be the preferable approach for bling large-scale data analytics. Various fields such as scientific computing, enterprise applications, media entertainment, natural language processing, and social network analysis generate a very large quantity of data which is considered a business asset as businesses are realizing the importance of data analytics. So, such an increment in the volume of data requires more efficient data management techniques.
  •  As we discussed earlier, Cloud computing provides on-demand access to enormous amounts of computing capability enabling developers to create software systems that expand progressively to arbitrary levels of parallelism as software applications and services run on hundreds or thousands of nodes. Traditional computing architectures are not capable to accommodate such a dynamic environment. So, new data management technologies are introduced to support data-intensive computing. 
  • Distributed file systems play a vital role in big data management providing the interface to store information in the form of files and perform read and write operations later on. Some file system implementations in a cloud environment address the administration of large quantities of data on several nodes. Those file systems comprise data storage support for large computing clusters, supercomputers, parallel architectures, as well as storage and computing clouds.
  •  File systems such as Lustre, IBM General Parallel File System (GPFS), Google File System (GFS) Sector/Sphere, and Amazon Simple Storage Service (S3) are widely used for high-performance distributed file systems and storage clouds.

Lustre 

The Lustre file system is an open-source, parallel file system that meets many of the needs of high-performance computing simulation settings. The Lustre file system grew from a research project at Carnegie Mellon University to a file system that supports some of the world's most powerful supercomputers. Lustre is meant to enable access to petabytes (PBS) of storage while serving thousands of clients at hundreds of gigabytes per second (GB/s). Lustre provides a POSIX-compliant file system interface and can grow to thousands of clients, petabytes of storage, and hundreds of gigabytes per second of I/O bandwidth. The Metadata Servers (MDS), Metadata Targets (MDT), Object Storage Servers (OSS), Object Server Targets (OST), and Lustre clients are the core components of the Lustre file system.

IBM General Parallel File System (GPFS)

The IBM General Parallel File System (GPFS) is a high-performance distributed file system developed by IBM that provides support for the RS/6000 supercomputer and Linux computing clusters. It is a cluster file system that allows numerous nodes to access a single file system or collection of file systems at the same time. These nodes can be fully SAN-attached or a combination of SAN and network-attached. This offers high-performance access to this shared collection of data, which may be used to support a scale-out solution or to provide a high-availability platform. Beyond common data access, GPFS provides additional capabilities such as data replication, policy-based storage management, and multi-site operations. A GPFS cluster can be made up of AIX nodes, Linux nodes, Windows server nodes, or a combination of the three. GPFS may run on virtualized instances in contexts that use logical partitioning or other hypervisors to provide shared data access. Multiple GPFS clusters can share data inside a single place or over WAN networks.

Google File System (GFS)

Google File System (GFS) is the storage architecture that allows distributed applications to run on Google's computing cloud. GFS is designed to run on multiple, standard x86-based servers and it is intended to be a fault-tolerant, highly available distributed file system. The GFS was created with many of the same aims in mind as previous distributed file systems, such as scalability, performance, dependability, and resilience. GFS, on the other hand, was created by Google to achieve some specific aims motivated by certain significant observations of their workload. Google experienced regular failures of its cluster machines so, a distributed file system must be exceedingly faulted tolerant and have some type of automated fault recovery. As multi-gigabyte files are prevalent in today's computing environment, 1/0 and file block size must be adequately planned. Similarly, the majority of files are appended rather than rewritten or altered so optimization efforts should be directed at appending files. GFS, rather than being a generic implementation of a distributed file system, is tailored to Google's needs in terms of distributed storage for applications.

Sector/Sphere 

Sector/Sphere facilitates distributed data storage, dissemination, and processing over huge clusters of commodity computers, either inside a single data center or across numerous data centers. The sector is a distributed file system that is fast, scalable, and secure. The Sphere is a high-performance parallel data processing engine that can handle Sector data files on storage nodes using extremely simple programming interfaces. The Sector is one of the few file systems capable of supporting several data centers across a wide area of networks. Sector, unlike other file systems, does not split files into blocks but instead replicates full files over several nodes, allowing users to tailor the replication technique for improved performance. The architecture of the system is made up of four nodes: a security server, one or more master nodes, slave nodes, and client machines. The security server stores all information regarding access control rules for users and files, whereas master servers coordinate and serve client I/O requests, which eventually interact with slave nodes to access files The Sector employs UDT protocol to provide high-speed data transfer, and its data placement technique enables it to function efficiently as a content distribution network over WAN. For high reliability and availability, Sector does not require hardware RAID; instead, data is automatically replicated in Sector.

Amazon Simple Storage Service (Amazon S3)

Amazon Simple Storage Service (Amazon S3) is storage for the Internet which is designed to make web-scale computing easier. Despite the internal specification is not disclosed, the system is said to enable high availability, dependability, scalability, security, high performance, limitless storage, and minimal latency at a cheap cost. The solution provides flat storage space grouped into buckets and linked to an Amazon Web Services (AWS) account. Each bucket can hold several objects, each of which is identifiable by a unique key. Objects are identifiable by unique URLs and accessible over HTTP, allowing for extremely easy get-put semantics. Because HTTP is used, no special library is required to access the storage system, the items of which may also be obtained via the Bit Torrent protocol.

Customers of all sizes and sectors may use it to store and safeguard any quantity of data for a variety of cases, including data lakes, websites, mobile apps, backup and restore, archiving, business applications, IoT devices, and big data analytics. Amazon S3 offers simple administration capabilities that allow you to organize your data and establish fine-grained access restrictions to fit your unique business, organizational, and compliance needs. Amazon S3 is designed for 99.999999999% (11 9's) of durability, and stores data for millions of applications for companies all around the world.



Comments

Popular posts from this blog

What are different steps used in JDBC? Write down a small program showing all steps.

Explain Parallel Efficiency of MapReduce.

Suppose that a data warehouse for Big-University consists of the following four dimensions: student, course, semester, and instructor, and two measures count and avg_grade. When at the lowest conceptual level (e.g., for a given student, course, semester, and instructor combination), the avg_grade measure stores the actual course grade of the student. At higher conceptual levels, avg_grade stores the average grade for the given combination. a) Draw a snowflake schema diagram for the data warehouse. b) Starting with the base cuboid [student, course, semester, instructor], what specific OLAP operations (e.g., roll-up from semester to year) should one perform in order to list the average grade of CS courses for each BigUniversity student. c) If each dimension has five levels (including all), such as “student < major < status < university < all”, how many cuboids will this cube contain (including the base and apex cuboids)?