Describe BUC method for computing iceberg cubes from the apex cuboid downward.

 BUC is an algorithm for sparse and iceberg cube computation. Unlike Multi-Way, BUC constructs the cube from the apex cuboid toward the base cuboid as shown in figure 4.5. Partitioning analysis and operation are examination costs in BUC's cube computation. Since recursive analysis in BUC does not cut the input size, both divide and aggregation are expensive. BUC uses the bottom-up approach that allows pruning unnecessary computation by recurring to the A-priori pruning strategy. If a given cell does not satisfy minimum support, then neither can any of its descendants. The Iceberg cube problem is to compute all group-bys that satisfy an iceberg condition.



BUC Algorithm Explanation:

  • Initially, the algorithm takes relation (set of tuples) as input and aggregates the entire input, and writes the resulting total.
  • Divide dimensions into partitions: for each dimension, the input is partitioned on dimensions and count the total no. of tuples for each distinct value of the dimension. Each distinct value of dimension from its own partition.
  • Test the partition for minimum support to perform iceberg pruning:

Apriori property: if a cell does not satisfy the minimum support, then neither can any it's descendants. 

Iceberg pruning: if the no, of tuples in the partition does not satisfy (i.e.,<) the minimum support, then its descendants can be pruned. 

  • If the number of tuples in the partition satisfies(i.e., >=) minimum support then the partition descends one level deeper into the lattice and so on. In return continue with the next partition for dimension.
  • After all the partitions have been processed, the entire process is repeated for each of the remaining dimensions.

Example: Let's look at how BUC constructs the iceberg cube for the dimensions A, B, and C having 3 as minimum support count?

Solution:- Suppose that dimension A has four distinct values, a1, a2, a3, a4; B has four distinct values, b1, b2, b3, b4; and C has two distinct values, c1, C2. To construct the iceberg cube, we must compute every combination of the grouping attributes that satisfy minimum support (i.e., that have 3 tuples). To do so, BUC scans the input, aggregating the tuples to obtain a count for all, corresponding to the cell (*, *, *).

(Optional for understanding point of view)

• Starting with A dimension value, a1, the a1 partition is aggregated, creating one tuple for the A group-by, corresponding to the cell (a1, *, *)

• Suppose (a1, *, *) satisfies the minimum support, in which case a recursive call is made on the partition for a1

• Suppose the cell count for (a1, b1, *) is 2, which does not satisfy the minimum support so, BUC prunes any further exploration of (a1, b1, *).That is, it avoids partitioning this cell on dimension B

• It backtracks to the a1, partition and recurses on (a1, b2, *), and so on


The performance of BUC is sensitive to the order of the dimensions and to skew in the data. Ideally, the most discriminating dimensions should be processed first. Dimensions should be processed in order of decreasing cardinality. The higher the cardinality is, the smaller the partitions are, and thus, the more partitions there will be, thereby providing BUC with greater opportunity for pruning. Similarly, the more uniform a dimension is (i.e., having less skew), the better it is for pruning.



Comments

Popular posts from this blog

Discuss classification or taxonomy of virtualization at different levels.

Suppose that a data warehouse consists of the three dimensions time, doctor, and patient, and the two measures count and charge, where a charge is the fee that a doctor charges a patient for a visit. a) Draw a schema diagram for the above data warehouse using one of the schemas. [star, snowflake, fact constellation] b) Starting with the base cuboid [day, doctor, patient], what specific OLAP operations should be performed in order to list the total fee collected by each doctor in 2004? c) To obtain the same list, write an SQL query assuming the data are stored in a relational database with the schema fee (day, month, year, doctor, hospital, patient, count, charge)

What is RMI? Discuss stub and skeleton. Explain its role in creating distributed applications.