Explain the k-means partitioning algorithm with pros and cons with it.
K-MEANS ALGORITHM
- K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different clusters. Here K defines the number of pre-defined clusters that need to be created in the process, as if K-2, there will be two clusters, and for K-3, there will be three clusters, and so on.
- K-means algorithm is a centroid-based clustering algorithm. The centroid of a cluster is its center point. For the k-means algorithm, the centroid is the mean of data points assigned to the cluster. Initially, it is a random value of the data point present in the given data set.
- K-means algorithm takes dataset D and k as input and returns K-number of clusters. To find k- clusters first, it randomly selects k of the data points in D, each of which initially represents a cluster centroid. For each of the remaining data points, a data point is assigned to the cluster to which it is the most similar, based on the Euclidean distance between the data point and the cluster center.
- The k-means algorithm then iteratively improves the within-cluster variation. For each cluster, it computes the new cluster center(mean) using the Data points assigned to the cluster in the previous iteration. All the data points are then reassigned using the updated means as the new cluster centers. The iterations continue until the assignment is stable, that is, the clusters formed in the current iteration are the same as those formed in the previous iteration.
- The k-means clustering algorithm mainly performs two tasks:
- Determines the best value for K center points or centroids by an iterative process.
- Assigns each data point to its closest k-center. Those data points which are near to the particular k-center, create a cluster.
The below diagram explains the working of the K-means Clustering Algorithm:
K-means Algorithm
- Partitioning clustering approach.
- Each cluster is associated with a centroid (center point or mean point).
- Each point is assigned to the cluster with the closest centroid.
- A number of clusters, K, must be specified.
The k-means partitioning algorithm.
Algorithm: k-means. The k-means algorithm for partitioning, where each cluster’s center is represented by the mean value of the objects in the cluster.
Input:
k: the number of clusters,
D: a data set containing n objects.
Output: A set of k clusters.
Method:
(1) arbitrarily choose k objects from D as the initial cluster centers;
(2) repeat
(3) (re)assign each object to the cluster to which the object is the most similar, based on the mean value of the objects in the cluster;
(4) update the cluster means, i.e., calculate the mean value of the objects for each cluster;
(5) until no change;
K-means Clustering – Details
- Initial centroids are often chosen randomly.
- Clusters produced vary from one run to another.
- The centroid is (typically) the mean of the points in the cluster.
- ‘Closeness’ is measured mostly by Euclidean distance, cosine similarity, correlation, etc.
- K-means will converge for common similarity measures mentioned above.
- Most of the convergence happens in the first few iterations.
- Often the stopping condition is changed to ‘Until relatively few points change clusters’
- Complexity is O( n * K * I * d )
n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes
Issues and Limitations for K-means
- How to choose initial centers?
- How to choose K?
- How to handle Outliers?
- Clusters different in
- Density
- Size
- Assumes clusters are spherical in vector space
- Sensitive to coordinate changes
Pros
- Simple
- Fast for low dimensional data
- It can find pure sub-clusters if a large number of clusters is specified
Cons
- K-Means cannot handle non-globular data of different sizes and densities
- K-Means will not identify outliers
- K-Means is restricted to data that has the notion of a center (centroid)
- Applicable only when mean is defined, then what about categorical data?
- Need to specify k, the number of clusters, in advance
- Unable to handle noisy data and outliers
- Not suitable to discover clusters with non-convex shapes
Comments
Post a Comment