Explain different general strategies for cube computation./What are the following are general optimization techniques for the efficient computation of data cubes?
With different kinds of cubes as described above, we can expect that there are a good number o methods for efficient computation. In general, there are two basic data structures used for storing cuboids. Relational tables are used as the basic data structure for the implementation of relational OLAP (ROLAP), while multidimensional arrays are used as the basic data structure multidimensional OLAP (MOLAP). Although ROLAP and MOLAP may each explore different cube computation techniques, some optimization tricks" can be shared among the different data representations. The following are general optimization techniques for the efficient computation of data cubes.
1. Sorting, hashing, and grouping:-Sorting hashing, and outing operations should be applied to the dimension attributes in order to reorder and cluster related tuples. In cube computation, aggregation is performed on the tuples (or cells) that share the same set of dimension values Thus, it is important to explore sorting, hashing, and grouping operations to access and such data together to facilitate computation of such aggregates.
For example, to compute total sales by branch, day, and product, it is more efficient to sort tuples or cells by branch, and then by day, and then group them according to the product name. Such implementations can be extended to data cube computation. group
This technique can also be further extended to perform:
Shared-sorts: Sharingsorting costs across multiple cuboids when the sort-based methods are used.
Shared-partitions: Sharing the partitioning cost across multiple cuboids when hash-based algorithms are used.
2. Simultaneous aggregation and caching intermediate results: In cube computation, it is efficient to compute higher-level aggregates from previously computed lower-leve aggregates, rather than from the base fact table. Moreover, simultaneous aggregation from cached intermediate computation results may lead to the reduction of expensive disk 1/0 operations.
For example, to compute sales by branch, we can use the intermediate results derived from the computation of a lower-level cuboid, such as sales by branch and day. This technique can be further extended to perform amortized scans i.e., computing as many cuboids as possible at the same time to reduce disk reads.
3. Aggregation from the smallest child, when there exist multiple child cuboids: When there exist multiple child cuboids, it is usually more efficient to compute the desired parent (i.e. more generalized) cuboid from the smallest, previously computed child cuboid.
For example, to compute a sales cuboid, C_branch when there exist two previously computed cuboids, C_branch, C_year, and C_branch, C_item it is obviously more efficient to compute C_branch from the former than from the latter if there are many more distinct items than distinct years.
4. The Apriori pruning method can be explored to compute iceberg cubes efficiently: The Apriori property, in the context of data cubes, states as follows: If a given cell does not satisfy minimum support, then no descendant (e. more specialized or detailed version) of the cell will satisfy minimum support either. This property can be used to substantially reduce the computation of iceberg cubes.
Recall that the specification of iceberg cubes contains an iceberg condition, which is a constraint on the cells to be materialized. A common iceberg condition is that the cells must satisfy a minimum support threshold such as a minimum count or sum. In this situation, the Apriori property can be used to prune away the exploration of the cell's descendants.
Comments
Post a Comment