Explain the k-means partitioning algorithm with pros and cons with it.

K-MEANS ALGORITHM

  • K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different clusters. Here K defines the number of pre-defined clusters that need to be created in the process, as if K-2, there will be two clusters, and for K-3, there will be three clusters, and so on.
  •  K-means algorithm is a centroid-based clustering algorithm. The centroid of a cluster is its center point. For the k-means algorithm, the centroid is the mean of data points assigned to the cluster. Initially, it is a random value of the data point present in the given data set. 
  • K-means algorithm takes dataset D and k as input and returns K-number of clusters. To find k- clusters first, it randomly selects k of the data points in D, each of which initially represents a cluster centroid. For each of the remaining data points, a data point is assigned to the cluster to which it is the most similar, based on the Euclidean distance between the data point and the cluster center. 
  • The k-means algorithm then iteratively improves the within-cluster variation. For each cluster, it computes the new cluster center(mean) using the Data points assigned to the cluster in the previous iteration. All the data points are then reassigned using the updated means as the new cluster centers. The iterations continue until the assignment is stable, that is, the clusters formed in the current iteration are the same as those formed in the previous iteration.
  • The k-means clustering algorithm mainly performs two tasks: 

- Determines the best value for K center points or centroids by an iterative process.

- Assigns each data point to its closest k-center. Those data points which are near to the particular k-center, create a cluster.

The below diagram explains the working of the K-means Clustering Algorithm: 

 

K-means Algorithm

  • Partitioning clustering approach.
  • Each cluster is associated with a centroid (center point or mean point).
  • Each point is assigned to the cluster with the closest centroid.
  • A number of clusters, K, must be specified.



The k-means partitioning algorithm.

Algorithm: k-means. The k-means algorithm for partitioning, where each cluster’s center is represented by the mean value of the objects in the cluster.

Input:

k: the number of clusters,

D: a data set containing n objects.

Output: A set of k clusters.

Method:

(1) arbitrarily choose k objects from D as the initial cluster centers;

(2) repeat

(3) (re)assign each object to the cluster to which the object is the most similar, based on the mean value of the objects in the cluster;

(4) update the cluster means, i.e., calculate the mean value of the objects for each cluster;

(5) until no change;



K-means Clustering – Details

  • Initial centroids are often chosen randomly.
  • Clusters produced vary from one run to another.
  • The centroid is (typically) the mean of the points in the cluster.
  • ‘Closeness’ is measured mostly by Euclidean distance, cosine similarity, correlation, etc.
  • K-means will converge for common similarity measures mentioned above.
  • Most of the convergence happens in the first few iterations.
  • Often the stopping condition is changed to ‘Until relatively few points change clusters’
  • Complexity is O( n * K * I * d )

n = number of points, K = number of clusters,

I = number of iterations, d = number of attributes



Issues and Limitations for K-means

  • How to choose initial centers?
  • How to choose K?
  • How to handle Outliers?
  • Clusters different in
- Shape
- Density
- Size
  • Assumes clusters are spherical in vector space
  • Sensitive to coordinate changes


Pros

  •  Simple
  •  Fast for low dimensional data
  •  It can find pure sub-clusters if a large number of clusters is specified


Cons

  • K-Means cannot handle non-globular data of different sizes and densities
  •  K-Means will not identify outliers
  •  K-Means is restricted to data that has the notion of a center (centroid)
  •  Applicable only when mean is defined, then what about categorical data?
  •  Need to specify k, the number of clusters, in advance
  •  Unable to handle noisy data and outliers
  •  Not suitable to discover clusters with non-convex shapes

Comments

Popular posts from this blog

Suppose that a data warehouse consists of the three dimensions time, doctor, and patient, and the two measures count and charge, where a charge is the fee that a doctor charges a patient for a visit. a) Draw a schema diagram for the above data warehouse using one of the schemas. [star, snowflake, fact constellation] b) Starting with the base cuboid [day, doctor, patient], what specific OLAP operations should be performed in order to list the total fee collected by each doctor in 2004? c) To obtain the same list, write an SQL query assuming the data are stored in a relational database with the schema fee (day, month, year, doctor, hospital, patient, count, charge)

Suppose that a data warehouse for Big-University consists of the following four dimensions: student, course, semester, and instructor, and two measures count and avg_grade. When at the lowest conceptual level (e.g., for a given student, course, semester, and instructor combination), the avg_grade measure stores the actual course grade of the student. At higher conceptual levels, avg_grade stores the average grade for the given combination. a) Draw a snowflake schema diagram for the data warehouse. b) Starting with the base cuboid [student, course, semester, instructor], what specific OLAP operations (e.g., roll-up from semester to year) should one perform in order to list the average grade of CS courses for each BigUniversity student. c) If each dimension has five levels (including all), such as “student < major < status < university < all”, how many cuboids will this cube contain (including the base and apex cuboids)?

What is the cloud cube model? Explain in context to the Jericho cloud cube model along with its various dimensions.