Explain technologies for data intensive computing.

TECHNOLOGIES FOR DATA-INTENSIVE COMPUTING

The creation of applications that are primarily focused on processing massive amounts of data is referred to as data-intensive computing. Storage systems and programming models form a natural categorization of the technologies that enable data-intensive computing.

1) Storage Systems

Because of the growth of unstructured data in the form of blogs, Web pages, software logs, and sensor reading the relational model in its original formulation doesn't appear to be the preferable approach for bling large-scale data analytics. Various fields such as scientific computing, enterprise applications, media entertainment, natural language processing, and social network analysis generate a very large quantity of data which is considered a business asset as businesses are realizing the importance of data analytics. So, such an increment in the volume of data requires more efficient data management techniques.

As we discussed earlier, Cloud computing provides on-demand access to enormous amounts of computing capability enabling developers to create software systems that expand progressively to arbitrary levels of parallelism as software applications and services run on hundreds or thousands of nodes. Traditional computing architectures are not capable to accommodate such a dynamic environment. So, new data management technologies are introduced to support data-intensive computing.

Distributed file systems play a vital role in big data management providing the interface to store information in the form of files and perform read and write operations later on. Some file system implementations in a cloud environment address the administration of large quantities of data on several nodes. Those file systems comprise data storage support for large computing clusters, supercomputers, parallel architectures, as well as storage and computing clouds.

File systems such as Lustre, IBM General Parallel File System (GPFS), Google File System (GFS) Sector/Sphere, and Amazon Simple Storage Service (S3) are widely used for high-performance distributed file systems and storage clouds.

Lustre

The Lustre file system is an open-source, parallel file system that meets many of the needs of high-performance computing simulation settings. The Lustre file system grew from a research project at Carnegie Mellon University to a file system that supports some of the world's most powerful supercomputers. Lustre is meant to enable access to petabytes (PBS) of storage while serving thousands of clients at hundreds of gigabytes per second (GB/s). Lustre provides a POSIX-compliant file system interface and can grow to thousands of clients, petabytes of storage, and hundreds of gigabytes per second of I/O bandwidth. The Metadata Servers (MDS), Metadata Targets (MDT), Object Storage Servers (OSS), Object Server Targets (OST), and Lustre clients are the core components of the Lustre file system.

IBM General Parallel File System (GPFS)

The IBM General Parallel File System (GPFS) is a high-performance distributed file system developed by IBM that provides support for the RS/6000 supercomputer and Linux computing clusters. It is a cluster file system that allows numerous nodes to access a single file system or collection of file systems at the same time. These nodes can be fully SAN-attached or a combination of SAN and network-attached. This offers high-performance access to this shared collection of data, which may be used to support a scale-out solution or to provide a high-availability platform. Beyond common data access, GPFS provides additional capabilities such as data replication, policy-based storage management, and multi-site operations. A GPFS cluster can be made up of AIX nodes, Linux nodes, Windows server nodes, or a combination of the three. GPFS may run on virtualized instances in contexts that use logical partitioning or other hypervisors to provide shared data access. Multiple GPFS clusters can share data inside a single place or over WAN networks.

Google File System (GFS)

Google File System (GFS) is the storage architecture that allows distributed applications to run on Google's computing cloud. GFS is designed to run on multiple, standard x86-based servers and it is intended to be a fault-tolerant, highly available distributed file system. The GFS was created with many of the same aims in mind as previous distributed file systems, such as scalability, performance, dependability, and resilience. GFS, on the other hand, was created by Google to achieve some specific aims motivated by certain significant observations of their workload. Google experienced regular failures of its cluster machines so, a distributed file system must be exceedingly faulted tolerant and have some type of automated fault recovery. As multi-gigabyte files are prevalent in today's computing environment, 1/0 and file block size must be adequately planned. Similarly, the majority of files are appended rather than rewritten or altered so optimization efforts should be directed at appending files. GFS, rather than being a generic implementation of a distributed file system, is tailored to Google's needs in terms of distributed storage for applications.

Sector/Sphere

Sector/Sphere facilitates distributed data storage, dissemination, and processing over huge clusters of commodity computers, either inside a single data center or across numerous data centers. The sector is a distributed file system that is fast, scalable, and secure. The Sphere is a high-performance parallel data processing engine that can handle Sector data files on storage nodes using extremely simple programming interfaces. The Sector is one of the few file systems capable of supporting several data centers across a wide area of networks. Sector, unlike other file systems, does not split files into blocks but instead replicates full files over several nodes, allowing users to tailor the replication technique for improved performance. The architecture of the system is made up of four nodes: a security server, one or more master nodes, slave nodes, and client machines. The security server stores all information regarding access control rules for users and files, whereas master servers coordinate and serve client I/O requests, which eventually interact with slave nodes to access files The Sector employs UDT protocol to provide high-speed data transfer, and its data placement technique enables it to function efficiently as a content distribution network over WAN. For high reliability and availability, Sector does not require hardware RAID; instead, data is automatically replicated in Sector.

Amazon Simple Storage Service (Amazon S3)

Amazon Simple Storage Service (Amazon S3) is storage for the Internet which is designed to make web-scale computing easier. Despite the internal specification is not disclosed, the system is said to enable high availability, dependability, scalability, security, high performance, limitless storage, and minimal latency at a cheap cost. The solution provides flat storage space grouped into buckets and linked to an Amazon Web Services (AWS) account. Each bucket can hold several objects, each of which is identifiable by a unique key. Objects are identifiable by unique URLs and accessible over HTTP, allowing for extremely easy get-put semantics. Because HTTP is used, no special library is required to access the storage system, the items of which may also be obtained via the Bit Torrent protocol.

Customers of all sizes and sectors may use it to store and safeguard any quantity of data for a variety of cases, including data lakes, websites, mobile apps, backup and restore, archiving, business applications, IoT devices, and big data analytics. Amazon S3 offers simple administration capabilities that allow you to organize your data and establish fine-grained access restrictions to fit your unique business, organizational, and compliance needs. Amazon S3 is designed for 99.999999999% (11 9's) of durability, and stores data for millions of applications for companies all around the world.

2. Programming models

Apache CouchDB and MongoDB, Amazon Dynamo, Google Bigtable, Apache Cassandra, Hadoop H, etc. are some notable implementations that enable data-intensive applications.

Apache CouchDB

Apache CouchDB is a NoSQL document database that is open source that gathers and stores data in JSON-based document formats. In contrast to relational databases, CouchDB has a schema-free d architecture, which facilitates record maintenance across a variety of computer platforms, mobile phones, and web browsers. Apache CouchDB stores data in JSON queries in JavaScript using MapReduce 2 and provides an API over HTTP. In CouchDB, the fundamental unit of data is the document, which includes metadata. Document fields are individually named and include a variety of values; there is a predefined limit on text size or element count. CouchDB ensures ACID properties on data.

MongoDB

MongoDB is a major NoSQL database and an open-source document database developed in C++. It's a document-oriented NoSQL database that is used for large-scale data storage. It uses JSON-like documents with optional schemas and employs collections and documents rather than tables and rows as traditional relational databases. Documents are made up of key-value pairs, which are the fundament unit of data and collections are the equivalent of relational database tables in that they include collections of documents and functions. MongoDB offers sharding, which is the ability to divide a collection's contents over many nodes.

CouchDB

CouchDB and MongoDB both offer schema-less storage in which the primary objects are documents and grouped into a set of key-value fields. Each field's value can be a text, integer, float, date, or array of data. These databases provide a RESTful interface and data are stored in JSON format. Both of the Dis handle big files like documents and allow searching and indexing of the stored data using the MapReduce programming model. They also offer JavaScript as a base language for data searching and manipulation rather than SQL. These two systems also provide data replication and high availability.

Amazon DynamoDB

Amazon DynamoDB is a key-value and document database that promises performance in single-dig milliseconds at any size. It's a fully managed, multi-region, multi-active, long-lasting database with built-in security, backup and restore, and in-memory caching for web-scale applications. DynamoDB can handle more than 10 trillion requests per day, with peaks of more than 20 million requests per second. Many of the world's fastest-growing companies, like Lyft, Airbnb, and Redfin, as well as corporations like Samsung, Toyota, and Capital One, rely on DynamoDB's scale and performance to sen mission-critical workloads.

Hundreds of thousands of Amazon Online Services customers have selected DynamoDB as their key value and document database for mobile, web, gaming, ad tech, IoT, and other applications requiring low latency data access at any scale.

Google Bigtable is a distributed, column-oriented data store developed by Google to manage massive volumes of structured data related to the company's Internet search and Web services operations. It is a fully managed, scalable NoSQL database service for large analytical and operational workloads with up to 09% availability. Bigtable was created to help with applications that require large scalability. The system is designed to handle petabytes of data. The database was designed to run on clustered servers and has a basic data format described by Google as "a sparse, distributed, permanent multi-dimensional sorted map". The data is placed in order by row key, and the map's indexing is sorted by row, column keys, and timestamps. Algorithms for compression aid in achieving high capacity. Google Bigtable is the database used by Google App Engine Datastore, Google Personalized Search, Google Earth, and Google Analytics. Google has kept the program as a proprietary, in-house technology. Nonetheless, Bigtable has had a significant effect on the architecture of NoSQL databases. Google software developers disclosed Bigtable information during a symposium in 2006.

As the published material revealed BigTable's inner workings, several other corporations and open-source development teams established Bigtable equivalents, such as the Apache HBase, Cassandra, Hypertable, etc.

Apache Cassandra

Apache Cassandra is a distributed, open-source NoSQL database. It demonstrates a partitioned wide-column storage approach with eventually consistent semantics. Apache Cassandra was created at Facebook utilizing a staged event-driven architecture to integrate Amazon's Dynamo distributed storage and replication techniques with Google's Bigtable data and storage engine paradigm. Both Dynamo and Bigtable were created to fulfill rising requirements for scalable, dependable, and highly available storage systems, yet they both have flaws. Cassandra was intended as a best-in-class mixture of both systems to satisfy increasing large-scale storage requirements, both in terms of data footprint and query traffic. Since applications began to need complete global replication and always-available low-latency read and write, it became clear that a new type of database architecture was required, as relational database solutions struggled to fulfill the new needs of global-scale applications.

HBase

HBase is a non-relational column-oriented database management system that works on top of the Hadoop Distributed File System (HDFS). HBase was inspired by Google Bigtable in its design. HBase provides a fault-tolerant method of storing sparse data sets, which are common in many big data applications. It's ideal for real-time data processing or random read/write access to enormous amounts of data. HBase, unlike relational database systems, does not support a structured query language such as SQL; as HBase is not a relational data store. HBase apps, like Apache MapReduce apps, are written in Java. HBase also supports the development of applications in Apache Avro, REST, Thrift, etc.

HBase systems are intended to scale linearly. It is made up of regular tables with rows and columns, similar to a regular relational database. Each table must have a primary key declared, and all access attempts to HBase tables must utilize that key. As a component of HBase, Avro supports a diverse variety of fundamental data types such as numeric, binary data, and strings, as well as a variety of complex kinds such as arrays, maps, enumerations, and records. The data can also be sorted in a certain order.

After learning about file systems and databases required for data-intensive computing let's now discuss the programming of the data-intensive application. Platforms for programming data-intensive applications include abstractions that aid in the expression of computations over the big data, as well as Funtime systems capable of effectively managing such massive amounts of data. The programming platforms emphasize data processing and shift transfer management into the runtime system, making data describes computing in two simple functions-map and reduce- and hides the difficulties of maintaining always available whenever needed. MapReduce programming platform follows the same approach and massive data files in the platform's distributed file system.

Search This Blog

Notes for BSc CSIT