What are the Data Reduction Strategies ?
Data Reduction Strategies
Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of the original data. That is, mining on the reduced data set should be more efficient yet produce the same (or almost the same) analytical results. In this section, we first present an overview of data reduction strategies, followed by a closer look at individual techniques
Overview of Data Reduction Strategies
Data reduction strategies include dimensionality reduction, numerosity reduction, and data compression.
Dimensionality reduction
Dimensionality reduction is the process of reducing the number of random variables or attributes under consideration. Dimensionality reduction methods include wavelet transforms and principal components analysis, which transform or project the original data onto a smaller space. Attribute subset selection is a method of dimensionality reduction in which irrelevant, weakly relevant, or redundant attributes or dimensions are detected and removed.
Numerosity reduction
Numerosity reduction techniques replace the original data volume with alternative, smaller forms of data representation. These techniques may be parametric or non-parametric. For parametric methods, a model is used to estimate the data, so that typically only the data parameters need to be stored, instead of the actual data. (Outliers may also be stored.) Regression and log-linear models are examples. Sina Nonparametric methods for storing reduced representations of the data include his programs, clustering, sampling, and data cube aggregation.
Data compression,
In data compression, transformations are applied so as to obtain a reduced or "compressed" representation of the original data. If the original data can be reconstructed from the compressed data without any information loss, the data reduction is called lossless. If instead, we can reconstruct only an approximation of the original data, then the data reduction is called lossy. There are several lossless algorithms for string compression; however, they typically allow only limited data manipulation. Dimensionality reduction and numerosity reduction techniques can also be considered forms of data compression.
There are many other ways of organizing methods of data reduction. The computational time spent on data reduction should not outweigh or "erase" the time saved by mining on a reduced data set size.
OR,
Data Cube Aggregation: Aggregation operations are applied to the data in the construction of a data cube.
Dimensionality Reduction: In dimensionality reduction, redundant attributes are detected and removed which reduce the data set size
Data Compression: Encoding mechanisms are used to reduce the data set size.
Numerosity Reduction: In numerosity reduction where the data are replaced or estimated by alternative.
Discretization and concept hierarchy generation: Where raw data values for attributes are replaced by ranges or higher conceptual levels.
Comments
Post a Comment