Explain Data Reduction and also explain techniques of data reduction.
A database or data warehouse may store terabytes of data. So, it may take very long to perform data analysis and mining on such huge amounts of data. Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume but still contain critical information. Data reduction increases the efficiency of data mining. In the following section, we will discuss the techniques of data reduction. Techniques of data deduction include
- Dimensionality reduction
- Numerosity reduction and
- Data compression.
1. Dimensionality Reduction
Dimensionality reduction eliminates the attributes from the data set under consideration thereby reducing the volume of original data. In the section below, we will discuss three methods of dimensionality reduction. Example
If we know the mobile number, then we can know the mobile network so we need to reduce one dimension on the above table as below,
2. Numerosity Reduction
The numerosity reduction reduces the volume of the original data and represents it in a much smaller form. This technique includes two types parametric and non-parametric numerosity reduction
a) Parametric
Parametric numerosity reduction incorporates 'storing only data parameters instead of the original data'. One method of parametric numerosity reduction is regression and log-linear method. Linear regression models a relationship between the two attributes by modeling a linear equation to the data set. Suppose we need to model a linear function between two attributes
y= Wx +b
Here, y is the response attribute and x is the predictor attribute. If we discuss in terms of data mining, the attribute x, and the attribute y are the numeric database attributes whereas w and b are regression coefficients.
Multiple linear regression lets the response variable y to model linear function between two or more predictor variables. Log-linear model discovers the relation between two or more discrete attributes in the database. Suppose, we have a set of tuples presented in n-dimensional space. Then the log-linear model is used to study the probability of each tuple in a multidimensional space.
b)Non-parametric
Non-parametric methods for storing reduced representations of the data include histograms, clustering, and sampling Clustering techniques group the similar objects from the data in such a way that the objects in a cluster are similar to each other but they are dissimilar to objects in another cluster.
How much similar are the objects inside a cluster can be calculated by using a distance function. More is the similarity between the objects in a cluster closer they appear in the cluster The quality of cluster depends on the diameter of the cluster ie, at the max distance between any two objects in the cluster. The original data is replaced by the cluster representation. This technique is more effective if the present data can be classified into a distinct clustered.
3. Data Compression
Data compression is a technique where the data transformation technique is applied to the original data in order to obtain compressed data. If the compressed data can again be reconstructed to form the original data without losing any information, then it is a lossless' data reduction. If you are unable to reconstruct the original data from the compressed one then your data reduction is lossy. Dimensionality and numerosity reduction method are also used for data compression.
Comments
Post a Comment