Explain Data Reduction and also explain techniques of data reduction.

 A database or data warehouse may store terabytes of data. So, it may take very long to perform data analysis and mining on such huge amounts of data. Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume but still contain critical information. Data reduction increases the efficiency of data mining. In the following section, we will discuss the techniques of data reduction. Techniques of data deduction include

  • Dimensionality reduction
  • Numerosity reduction and
  • Data compression.

1. Dimensionality Reduction

Dimensionality reduction eliminates the attributes from the data set under consideration thereby reducing the volume of original data. In the section below, we will discuss three methods of dimensionality reduction. Example



If we know the mobile number, then we can know the mobile network so we need to reduce one dimension on the above table as below,



2. Numerosity Reduction

The numerosity reduction reduces the volume of the original data and represents it in a much smaller form. This technique includes two types parametric and non-parametric numerosity reduction


a) Parametric

Parametric numerosity reduction incorporates 'storing only data parameters instead of the original data'. One method of parametric numerosity reduction is regression and log-linear method. Linear regression models a relationship between the two attributes by modeling a linear equation to the data set. Suppose we need to model a linear function between two attributes

y= Wx +b

Here, y is the response attribute and x is the predictor attribute. If we discuss in terms of data mining, the attribute x, and the attribute y are the numeric database attributes whereas w and b are regression coefficients.

Multiple linear regression lets the response variable y to model linear function between two or more predictor variables. Log-linear model discovers the relation between two or more discrete attributes in the database. Suppose, we have a set of tuples presented in n-dimensional space. Then the log-linear model is used to study the probability of each tuple in a multidimensional space.


b)Non-parametric

Non-parametric methods for storing reduced representations of the data include histograms, clustering, and sampling Clustering techniques group the similar objects from the data in such a way that the objects in a cluster are similar to each other but they are dissimilar to objects in another cluster.

How much similar are the objects inside a cluster can be calculated by using a distance function. More is the similarity between the objects in a cluster closer they appear in the cluster The quality of cluster depends on the diameter of the cluster ie, at the max distance between any two objects in the cluster. The original data is replaced by the cluster representation. This technique is more effective if the present data can be classified into a distinct clustered. 


3. Data Compression 

Data compression is a technique where the data transformation technique is applied to the original data in order to obtain compressed data. If the compressed data can again be reconstructed to form the original data without losing any information, then it is a lossless' data reduction. If you are unable to reconstruct the original data from the compressed one then your data reduction is lossy. Dimensionality and numerosity reduction method are also used for data compression.

Comments

Popular posts from this blog

Suppose that a data warehouse for Big-University consists of the following four dimensions: student, course, semester, and instructor, and two measures count and avg_grade. When at the lowest conceptual level (e.g., for a given student, course, semester, and instructor combination), the avg_grade measure stores the actual course grade of the student. At higher conceptual levels, avg_grade stores the average grade for the given combination. a) Draw a snowflake schema diagram for the data warehouse. b) Starting with the base cuboid [student, course, semester, instructor], what specific OLAP operations (e.g., roll-up from semester to year) should one perform in order to list the average grade of CS courses for each BigUniversity student. c) If each dimension has five levels (including all), such as “student < major < status < university < all”, how many cuboids will this cube contain (including the base and apex cuboids)?

Suppose that a data warehouse consists of the three dimensions time, doctor, and patient, and the two measures count and charge, where a charge is the fee that a doctor charges a patient for a visit. a) Draw a schema diagram for the above data warehouse using one of the schemas. [star, snowflake, fact constellation] b) Starting with the base cuboid [day, doctor, patient], what specific OLAP operations should be performed in order to list the total fee collected by each doctor in 2004? c) To obtain the same list, write an SQL query assuming the data are stored in a relational database with the schema fee (day, month, year, doctor, hospital, patient, count, charge)

Discuss classification or taxonomy of virtualization at different levels.