Notes for BSc CSIT

CHARACTERISTICS RELATED TO DATA MODELS AND QUERY LANGUAGES. 1. Not Requiring a Schema: Allowing semi-structured and self-describing data. The users can specify a partial schema in some systems to improve storage efficiency, but it is not required to have a schema in most of the NoSQL systems. Constraints on the data would have to be programmed in the application programs that access the data items. Languages for describing semi-structured data: JSON (JavaScript Object Notation) and XML (Extensible Markup Language) 2. Less Powerful Query Languages: Many applications that use NoSQL systems may not require a powerful query language such as SQL, because search (read) queries in these systems often locate single objects in a single file based on their object keys. Reading and writing the data objects is accomplished by calling the appropriate operations by the programmer (API). SCRUD: Search, Create, Read, Update and Delete Provide a high...

What are the main categories of NOSQL systems? List a few of the NOSQL systems.

Types of NoSQL Databases Several different varieties of NoSQL databases have been created to support specific needs and use cases. These fall into four main categories of NoSQL systems: Document Databases Document databases, like JSON (JavaScript Object Notation) objects, store data in documents. Each document has a set of field and value pairs. The values might be of many sorts, such as texts, integers, Booleans, arrays, or objects, and their structures are usually aligned with the objects that developers interact with within code. Document databases are useful for a broad number of use cases and may be utilized as a general-purpose database due to their variety of field value types and strong query languages. They can expand out horizontally to accommodate enormous data volumes. Key-Value Databases Key-value databases are a simpler form of database that has keys and values for each item. Learning how to query for a certain key-value pair is usually straightforward because a valu...

How is a vertical partitioning of a relation specified? How can a relation be put back together from a complete vertical partitioning?

Vertical partitioning involves creating tables with fewer columns and using additional tables to store the remaining columns. Normalization also involves this splitting of columns across tables, but vertical partitioning goes beyond that and partitions columns even when already normalized. The primary key is duplicated to allow the original table to be reconstructed. Using join operation to reconstruct them.

How is a horizontal partitioning of a relation specified? How can a relation be put back together from a complete horizontal partitioning?

Horizontal partitioning divides a table into multiple tables. Each table then contains the same number of columns, but fewer rows. For example, a table that contains 1 billion rows could be partitioned horizontally into 12 tables, with each smaller table representing one month of data for a specific year. The primary key is duplicated to allow the original table to be reconstructed. Using union operation to reconstruct them.

What is meant by data allocation in distributed database design? Explain the alternative strategies regarding the placement of data.

Data Allocation Each fragment or each copy of a fragment is stored at a particular site in the distributed system with an "optimal" distribution. This process is called data distribution (or data allocation). The choice of sites and the degree of replication depend on the performance and availability goals of the system and on the types and frequencies of transactions submitted at each site. Example: If high availability is required, transactions can be submitted at any site, and most transactions are retrieved only, a fully replicated database is a good choice. However, if certain transactions that access particular parts of the database are mostly submitted at a particular site, the corresponding set of fragments can be allocated at that site only. Data that is accessed at multiple sites can be replicated at those sites. If any updates are performed, it may be useful to limit replication. Finding an optimal or even a good solution to distributed data allocation is a comple...

What additional functions does a DDBMS have over a centralized DBMS?

Additional functions of DDBMS over Centralized DBMS are as follows: Keeping track of data: The ability to keep track of the data distribution, fragmentation, and replication by expanding the DDBMS catalog. Distributed query processing: The ability to access remote sites and transmit queries and data among the various sites via a communication network. Distributed transaction management: The ability to devise execution strategies for queries and transactions that access data from more than one site and synchronize the access to distributed data and maintain the integrity of the overall database. Replicated data management: The ability to decide which copy of a replicated data item to access and to maintain the consistency of copies of a replicated data item. Distributed database recovery: The ability to recover from individual site crashes and from new types of failures such as the failure of communication links. Security: Distributed transactions must be executed with the proper ...

Explain the difference between distributed databases and client/server architecture.

Client/Server : Client/server is developed to deal with various computing environments that have a large number of computers and servers connected together via a network. In this architecture, a Client is a user machine that provides the user interface and local processing capabilities. When any client requires additional functionality like database access, it can connect to a Server that is capable of providing the functionality needed by the client. Basically, a Server is a machine that provides services to the Client i.e user machine. Distributed DBMS : In Distributed DBMS, data is distributed over the geographical site. Each site is a complete database system site on its end but the different sites have to work together because if any user wants to access the data it can easily access data anywhere in the network as the data is stored at the user’s own computer. the difference between distributed databases and client/server architecture. Client/Server : 1....

Define and explain the different types/dimensions of distribution transparency.

Distribution transparency Distribution transparency is the property of distributed databases by the virtue of which the internal details of the distribution are hidden from the users. The DDBMS designer may choose to fragment tables, replicate the fragments and store them at different sites. However, since users are oblivious of these details, they find the distributed database easy to use like any centralized database. The three dimensions of distribution transparency are − Location transparency Fragmentation transparency Replication transparency Location Transparency Location transparency ensures that the user can query any table(s) or fragment(s) of a table as if they were stored locally in the user’s site. The fact that the table or its fragments are stored at a remote site in the distributed database system, should be completely oblivious to the end user. The address of the remote site(s) and the access mechanisms are completely hidden. In order to incorporate location transp...

What are the components of a DDBMS?

Components of Distributed Database System The different components of DDBS are as follows: Computer workstations or remote devices (sites or nodes) form the network system. The distributed database system must be independent of the computer system hardware. Network hardware and software components that reside in each workstation or device. The network components allow all sites to interact and exchange data. Because the components-computers, operating systems, network hardware, and so on are likely to be supplied by different vendors, it is best to ensure that distributed database functions can be run on multiple platforms. Communications media carry the data from one node to another. The DDBMS must be communications media-independent; that is, it must be able to support several types of communications media. The transaction processor (TP) , is the software component found in each computer or device that requests data. The transaction processor receives and processes ...

Explain the difference between a distributed database and distributed processing.

In distributed processing, a database’s logical processing is shared among two or more physically independent sites that are connected through a network. For example, the data input/output (I/O), data selection, and data validation might be performed on one computer, and a report based on that data might be created on another computer. A distributed database, on the other hand, stores a logically related database over two or more physically independent sites. The sites are connected via a computer network. In contrast, the distributed processing system uses only a single-site database but shares the processing chores among several sites. In a distributed database system, a database is composed of several parts known as database fragments. The database fragments are located at different sites and can be replicated among various sites. Each database fragment is, in turn, managed by its local database process. Distributed processing does not require a distributed database, but ...

What are the advantages and disadvantages of the DDBMS?

Advantages of DDBMS The advantages of DDBS are as follows: 1. Reflects organizational structure: Many organizations are naturally distributed over several locations. 2. Improved share-ability and local autonomy: The geographical distribution of an organization can be reflected in the distribution of the data; users at one site can access. data stored at other sites. Data can be placed at the site close to the users who normally use that data. In this way, users have local control of the data and they can consequently establish and enforce local policies regarding the use of this data. A global DBA is responsible for the entire system. Generally, part of this responsibility is devolved to the local level, so that the local DBA can manage the local DBMS. 3. Improved availability: In a centralized DBMS, a computer failure terminates the operations of the DBMS. However, a failure at one site of a DDBMS or a failure of a communication link making some sites inaccessib...

Describe the evolution from centralized DBMSs to distributed DBMSs.

A major motivation behind the development of database systems is the desire to integrate the operational data of an organization and to provide controlled access to the data. Although integration and controlled access may imply centralization, this is not the intention. In fact, the development of computer networks promotes a decentralized mode of work. This decentralized approach mirrors the organizational structure of many companies, which are logically distributed into divisions, departments, projects, and so on, and physically distributed into offices, plants, and factories, where each unit maintains its own operational data. The shareability of the data and the efficiency of data access should be improved by the development of a distributed database system that reflects this organizational structure, makes the data in all units accessible, and stores data proximate to the location where it is most frequently used. Distributed DBMSs should help resolve the islands of informati...

How outer join and non-equi are join implemented?

Non-Equi Join Non-Equi Join is also a type of INNER Join in which we need to retrieve data from multiple tables. Non-Equi Join matches the column values from different tables based on an inequality based on the operators like <,>,<=,>=,!=, BETWEEN, etc. Non-Equi Join in SQL is retrieving data using any operator or condition except the equality condition. The value of the column in each row from the source table is compared with the corresponding value of the target table. If the data matches the source and target table, the comparison returns true, and therefore that data is retrieved from the table. However, we use the Non-Equi joins for the below-mentioned reasons- Retrieving data matching in a range of values. Checking for duplicate data between tables. For calculating totals. Syntax > SELECT * FROM TableName1, TableName2 WHERE TableName1.columnName [> | < | >= | <= | != | BETWEEN ] table_name2.column; OUTER J...

discuss semi-join and anti-join as operations to which nested queries may be mapped; provide an example of each.

Semi-Join Defined A “semi-join” between two tables returns rows from the first table where one or more matches are found in the second table. The difference between a semi-join and a conventional join is that rows in the first table will be returned at most once. Even if the second table contains two matches for a row in the first table, only one copy of the row will be returned. Semi-joins are written using the EXISTS or IN constructs. Suppose you have the DEPT and EMP tables in the SCOTT schema and you want a list of departments with at least one employee. You could write the query with a conventional join: SELECT D.deptno, D.dname FROM dept D, emp E WHERE E.deptno = D.deptno ORDER BY D.deptno; Unfortunately, if a department has 400 employees then that department will appear in the query output 400 times. You could eliminate ...

Contrast cost estimation and heuristic rules with regard to query optimization.

The contrast between cost-based optimization and heuristic rules with regard to query optimization are :- Cost-based Optimization This process of selecting a lower-cost mechanism is known as cost-based optimization. This is based on the cost of the query. The query can use different paths based on indexes, constraints, sorting methods, etc. This method mainly uses the statistics like record size, number of records, number of records per block, number of blocks, table size, whether the whole table fits in a block, organization of tables, uniqueness of column values, size of columns, etc. Some of the features of cost-based optimization are as follows: It is based on the cost of the query that to be optimized. The query can use a lot of paths based on the value of indexes, available sorting methods, constraints, etc. The aim of query optimization is to choose the most efficient path of implementing the query at the possible lowest minimum cost in the form of an algorithm. The ...

Explain pipelining approach of evaluation of expression in detail.

Pipelining In this method, DBMS does not store the records in temporary tables. Instead, it queries each query and the result of which will be passed to the next query to process and so on. It will process the query one after the other and each will use the result of the previous query for its processing. Pipelining evaluates multiple operations simultaneously by passing the results of one operation to the next one without storing the tuples on the disk. In the example of figure 3.6, all three operations can be placed in a pipeline, which passes the results of the selection to the join as they are generated. In turn, it passes the results of the join to the projection as they are generated. The memory requirements are low since the results of an operation are not stored for long. However, as a result of pipelining, the inputs to the operations are not available all at once for processing. Creating a pipeline of operations can provide two benefits: It eliminates the cost of r...

Explain how materialization evaluation works with example?

Materialization It is easiest to understand intuitively how to evaluate an expression by looking at a pictorial representation of the expression in an operator tree. In this method, the given expression evaluates one relational operation at a time. Also, each operation is evaluated in an appropriate sequence or order. After evaluating all the operations, the outputs are materialized in a temporary relation for their subsequent uses. The example of figure 3.6 is computed as following: By repeating the process, we will eventually evaluate the operation at the root of the tree, giving the final result of the expression. In our example, we get the final result by executing the projection operation at the root of the tree, using as input the temporary relation created by the join.

What do you mean by evaluation expression?

EVALUATION OF EXPRESSION We have studied how individual relational operations are carried out. The obvious way to evaluate an expression is simply to evaluate one operation at a time, in an appropriate order. Now we consider how to evaluate an expression containing multiple operations. There are two approaches to how a query execution tree can be evaluated: Materialization: Compute the result of an evaluation primitive and materialize (store) the new relation on the disk. Pipelining: Pass on tuples to parent operations even while an operation is still being executed.

What are the main strategies for implementing the Join operation?

Like selection, the join operation (joining) can be implemented in a variety of ways. In terms of disk accesses, the joining can be very expensive, so implementing and utilizing efficient join algorithms is critical in minimizing a query's execution time. The following are 5 well-known strategies for implementing the join operation/types of join operaton are described below:- Nested-Loop Join Block Nested-Loop Join Indexed Nested-Loop Join Sort-Merge Join Hash Join Nested-Loop Join Nested-Loop Join This algorithm consists of an inner for loop nested within an outer for loop. To illustrate this the algorithm, we will use the following notations: R, S Relations r and s tr Tuple (record) in relation r ts Tuple (record) in relation s nr Number of records in relation r ns Number of records in relation s br Number of blocks with records in relation r bs Number ...

Explain the external sort-merge algorithm with suitable example.