What is hierarchical clustering? What is agglomerative hierarchical clustering? What is divisive hierarchical clustering?

 The hierarchical clustering method produces a set of nested clusters organized as a hierarchical tree by performing Hierarchical decomposition (merge or split) of data point's base on a similarity or distance matrix Depending on whether the hierarchical decomposition is formed in a bottom-up (merging) or top-down (splitting) fashion a hierarchical clustering method can be classified into two main categories. They are:

Agglomerative hierarchical clustering: It works in a bottom-up manner. That is, each object is initially considered as a single-element cluster. At each step of the algorithm, the two clusters that are the most similar are combined into a new bigger cluster. This procedure is iterated until all points are members of just one single big cluster. The result is a tree as shown in the figure below. Agglomerative clustering is good at identifying small clusters.




Advantages of Hierarchical clustering:
  • Do not have to assume any particular number of clusters. Any desired number of clusters can be obtained by 'cutting the tree at the proper level.
  • They may correspond to meaningful taxonomies for example in biological sciences Sh(e.g., animal kingdom, phylogeny reconstruction).
Disadvantages of Hierarchical clustering:
  • Hierarchical clustering methods can encounter difficulties regarding the selection of merge or split points. Such a decision is critical because once a group of objects is merged or split, the process at the next step will operate on the newly generated clusters. It will neither undo what was done previously, nor perform object swapping between clusters. Thus, merge or split decisions, if not well chosen, may lead to low-quality clusters.
  • The methods do not scale well because each decision of merge or split needs to examine and evaluate many objects or clusters.
  • No global objective function is directly minimized

Comments

Popular posts from this blog

Discuss classification or taxonomy of virtualization at different levels.

What is RMI? Discuss stub and skeleton. Explain its role in creating distributed applications.

Suppose that a data warehouse for Big-University consists of the following four dimensions: student, course, semester, and instructor, and two measures count and avg_grade. When at the lowest conceptual level (e.g., for a given student, course, semester, and instructor combination), the avg_grade measure stores the actual course grade of the student. At higher conceptual levels, avg_grade stores the average grade for the given combination. a) Draw a snowflake schema diagram for the data warehouse. b) Starting with the base cuboid [student, course, semester, instructor], what specific OLAP operations (e.g., roll-up from semester to year) should one perform in order to list the average grade of CS courses for each BigUniversity student. c) If each dimension has five levels (including all), such as “student < major < status < university < all”, how many cuboids will this cube contain (including the base and apex cuboids)?