Explain the various techniques used for clustering.

 Clustering techniques are used to identify groups of similar objects in a data set collected from various sources such as marketing, bio-medical, web, geospatial, etc. The clustering techniques are broadly divided into Hard clustering (data point belongs to only one group) and Soft Clustering (data points can belong to another group also). But there are also other various techniques of Clustering that exist. They are:

1.  Partitioning approach: The partitioning method groups the data into k groups, which together satisfy the following requirements:

- Each group must contain at least one data point, and

- Each data point must belong to exactly one group.

. The partitioning method creates an initial partitioning. It then uses an iterative relocation technique that attempts to improve the partitioning by moving objects from one group to another. Improvement in partitioning is evaluated by some criterion, e.g., minimizing the sum of square errors. It is also known as the centroid-based method because it uses the centroid of a cluster, C to represent each cluster. The centroid of a cluster is its center points such as the mean, median, or mode of the data points assigned to the cluster. A typical example of algorithms based on the partitioning approach are k-means, K-means ++, mini-batch k means, k medoids, etc.



2. Hierarchical approach: Hierarchical method creates a hierarchical decomposition of the given set of data points using some criterion such as proximity measures. Hierarchical decomposition of data points results in a tree-like structure, which is also called a dendrogram.



a. Agglomerative Approach: This approach starts with each object forming a separate group. It keeps on merging the objects or groups that are close to one another. It keeps on doing so until all of the groups are merged into one or until the termination condition holds.

b. Divisive Approach: This approach starts with all of the objects in the same cluster. In
the continuous iteration, a cluster is split up into smaller clusters. It is down until each object in one cluster or the termination condition holds. This method is rigid, i.e., once a merging or splitting is done, it can never be undone. Typical examples of algorithms based on hierarchical approach are Diana, Agnes, BIRCH CAMELEON, etc.



3. Density-based approach: Density-based method is based on connectivity and density functions. That is, this method continues growing the given cluster as long as the density in the neighborhood exceeds some threshold, Le, for each data point within a given cluster, the radius of a given cluster has to contain at least a minimum number of points. Typical examples of algorithms based on density-based approach are DBSACN, OPTICS, DenClue, etc.


Figure 7.8: Density-based approach

4. Grid-based approach: Grid-based method is based on a multiple-level granularity structure. That is, the objects together form a grid. The object space is quantized into a finite number of cells that form a grid structure. This approach of clustering has a relatively faster processing time and it is dependent only on the number of cells in each dimension in the quantized space. Typical examples of algorithms based on grid-based approach are STING, Wave Cluster, CLIQUE, etc.




5. Model-based: In Model-based clustering method. a model is hypothesized for each of the clusters and tries to find the best fit data points of that model. This method also provides a way to automatically determine the number of clusters based on standard statistics, taking outlier or noise into account. Therefore, it yields robust clustering methods. The model is based on 3 general assumptions: We know the number of clusters before we start
Each observation in the data has a certain probability of belonging to each cluster. The observations within each cluster follow a normal distribution (with the appropriate dimension)Typical examples of algorithms based on model-based approach are Expectation-Maximization (EM), SOM, COBWEB, etc.



6. Constraint-based: Constraint-based Clustering method clusters the given data points by considering user-specified or application-specific constraints. So, it is also known as a user-guided clustering technique. Constraints for constraint-based clustering can be user expectations or the properties of desired clustering results. Typical examples of algorithms based on a hierarchical approach are COD (obstacles), constrained clustering, etc.



7. Link-based clustering: In link-based clustering methods massive links present in the data objects given in the input dataset is used to create the cluster of the input data object. Typical examples of algorithms based on link-based clustering methods are SimRank, LinkClus.



8. Fuzzy clustering: Fuzzy clustering is also known as the soft method. Standard clustering approaches produce partitions (K-means, PAM), in which each data point belongs to only one cluster. This is known as hard clustering. In Fuzzy clustering, items can be a member of more than one cluster. Each item has a set of membership coefficients corresponding to the degree of being in a given cluster. The Fuzzy c-means method is the most popular fuzzy clustering algorithm. In fuzzy c-means clustering, we find out the centroid of the data points and then calculate the distance of each data point from the given centroids until the clusters formed become constant.


Comments

Popular posts from this blog

Suppose that a data warehouse for Big-University consists of the following four dimensions: student, course, semester, and instructor, and two measures count and avg_grade. When at the lowest conceptual level (e.g., for a given student, course, semester, and instructor combination), the avg_grade measure stores the actual course grade of the student. At higher conceptual levels, avg_grade stores the average grade for the given combination. a) Draw a snowflake schema diagram for the data warehouse. b) Starting with the base cuboid [student, course, semester, instructor], what specific OLAP operations (e.g., roll-up from semester to year) should one perform in order to list the average grade of CS courses for each BigUniversity student. c) If each dimension has five levels (including all), such as “student < major < status < university < all”, how many cuboids will this cube contain (including the base and apex cuboids)?

Suppose that a data warehouse consists of the four dimensions; date, spectator, location, and game, and the two measures, count and charge, where charge is the fee that a spectator pays when watching a game on a given date. Spectators may be students, adults, or seniors, with each category having its own charge rate. a) Draw a star schema diagram for the data b) Starting with the base cuboid [date; spectator; location; game], what specific OLAP operations should perform in order to list the total charge paid by student spectators at GM Place in 2004?

Suppose that a data warehouse consists of the three dimensions time, doctor, and patient, and the two measures count and charge, where a charge is the fee that a doctor charges a patient for a visit. a) Draw a schema diagram for the above data warehouse using one of the schemas. [star, snowflake, fact constellation] b) Starting with the base cuboid [day, doctor, patient], what specific OLAP operations should be performed in order to list the total fee collected by each doctor in 2004? c) To obtain the same list, write an SQL query assuming the data are stored in a relational database with the schema fee (day, month, year, doctor, hospital, patient, count, charge)