What are the Data Reduction Strategies ?

 Data Reduction Strategies 

 Data reduction techniques can be applied to obtain a reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of the original data. That is, mining on the reduced data set should be more efficient yet produce the same (or almost the same) analytical results. In this section, we first present an overview of data reduction strategies, followed by a closer look at individual techniques

Overview of Data Reduction Strategies

 Data reduction strategies include dimensionality reduction, numerosity reduction, and data compression.

Dimensionality reduction

Dimensionality reduction is the process of reducing the number of random variables or attributes under consideration. Dimensionality reduction methods include wavelet transforms and principal components analysis, which transform or project the original data onto a smaller space. Attribute subset selection is a method of dimensionality reduction in which irrelevant, weakly relevant, or redundant attributes or dimensions are detected and removed.

Numerosity reduction

Numerosity reduction techniques replace the original data volume with alternative, smaller forms of data representation. These techniques may be parametric or non-parametric. For parametric methods, a model is used to estimate the data, so that typically only the data parameters need to be stored, instead of the actual data. (Outliers may also be stored.) Regression and log-linear models are examples. Sina Nonparametric methods for storing reduced representations of the data include his programs, clustering, sampling, and data cube aggregation.

Data compression,

In data compression, transformations are applied so as to obtain a reduced or "compressed" representation of the original data. If the original data can be reconstructed from the compressed data without any information loss, the data reduction is called lossless. If instead, we can reconstruct only an approximation of the original data, then the data reduction is called lossy. There are several lossless algorithms for string compression; however, they typically allow only limited data manipulation. Dimensionality reduction and numerosity reduction techniques can also be considered forms of data compression.

There are many other ways of organizing methods of data reduction. The computational time spent on data reduction should not outweigh or "erase" the time saved by mining on a reduced data set size.

                                 OR,

Data Cube Aggregation: Aggregation operations are applied to the data in the construction of a data cube.

Dimensionality Reduction: In dimensionality reduction, redundant attributes are detected and removed which reduce the data set size

Data Compression: Encoding mechanisms are used to reduce the data set size.

Numerosity Reduction: In numerosity reduction where the data are replaced or estimated by alternative.

Discretization and concept hierarchy generation: Where raw data values for attributes are replaced by ranges or higher conceptual levels.

Comments

Popular posts from this blog

Suppose that a data warehouse for Big-University consists of the following four dimensions: student, course, semester, and instructor, and two measures count and avg_grade. When at the lowest conceptual level (e.g., for a given student, course, semester, and instructor combination), the avg_grade measure stores the actual course grade of the student. At higher conceptual levels, avg_grade stores the average grade for the given combination. a) Draw a snowflake schema diagram for the data warehouse. b) Starting with the base cuboid [student, course, semester, instructor], what specific OLAP operations (e.g., roll-up from semester to year) should one perform in order to list the average grade of CS courses for each BigUniversity student. c) If each dimension has five levels (including all), such as “student < major < status < university < all”, how many cuboids will this cube contain (including the base and apex cuboids)?

Suppose that a data warehouse consists of the three dimensions time, doctor, and patient, and the two measures count and charge, where a charge is the fee that a doctor charges a patient for a visit. a) Draw a schema diagram for the above data warehouse using one of the schemas. [star, snowflake, fact constellation] b) Starting with the base cuboid [day, doctor, patient], what specific OLAP operations should be performed in order to list the total fee collected by each doctor in 2004? c) To obtain the same list, write an SQL query assuming the data are stored in a relational database with the schema fee (day, month, year, doctor, hospital, patient, count, charge)

Discuss classification or taxonomy of virtualization at different levels.