Demystifying sklearn.cluster.kmeans

Sklearn? cluster? K-means?

Scikit-learn (sklearn) is an open-source Python library that is mainly used in data analysis and machine learning. In machine learning, there are (arguably) two main categories:

Supervised machine learning, which has both the predictor variablesalso called features or independent variables and the targetalso called the label or dependent variable , and is used to make predictions.
Unsupervised machine learning which has only has independent variables, and is used for pattern recognition.

The below example highlights the difference: Image source

As seen in the above image, the supervised learning method uses the classification algorithm to predict whether an animal is a duck or not. The data has labels/targets (“Duck” vs “Not Duck”) which the algorithm uses to make predictions. On the other hand, the unsupervised learning algorithm uses clustering to group the animals into categories or clusters based on their similarities. The three birds are in one cluster, the rabbit is in another, and the hedgehog is in yet another cluster.

K-means clustering is a clustering algorithm that divides data points into groups or clusters based on how similar or close to each other they are. Each cluster has a centroid, which is a real or imaginary data point that is at the center of the cluster. The aim of k-means clustering is to minimize the distance between the cluster points and their respective centroids.

sklearn.cluster.kmeans uses the K-means algorithm which is part of the cluster module in the Sklearn library.

How does K-means clustering work?

Choose a value for the number of clusters you wish to have. For example, k=3 will set up 3 clusters.
Randomly select k data points to act as initial centroids for those clusters.
Measure the distance between each point and the centroid, and assign each point to the cluster it is closest to.
Calculate the means of the data points in each cluster and set them as the new centroids or cluster centers.
Repeat the process of adding data points to clusters whose centroids they are closest to until the centroids and other data points stop changing or until you reach the maximum number of iterations.

Free Resources

License: Creative Commons-Attribution-ShareAlike 4.0 (CC-BY-SA 4.0)

Learn in-demand tech skills in half the time

PRODUCTS

Mock Interview

New

Courses

Skill Paths

Projects

Assessments

TRENDING TOPICS

Learn to Code

Tech Interview Prep

Generative AI

Data Science

Machine Learning

GitHub Students Scholarship

Early Access Courses

Blind 75

Layoffs

Pricing

For Individuals

Try for Free

Gift a Subscription

CONTRIBUTE

Become an Author

Become an Affiliate

Earn Referral Credits

RESOURCES

Blog

Cheatsheets

Webinars

Answers

ABOUT US

Our Team

Careers

Hiring

Frequently Asked Questions

Press

LEGAL

Cookie Policy

Business Terms of Service

Data Processing Agreement

INTERVIEW PREP COURSES

Grokking the Modern System Design Interview

Grokking the Product Architecture Design Interview

Grokking the Coding Interview Patterns

Machine Learning System Design

Demystifying sklearn.cluster.kmeans

Getting started

Sklearn? cluster? K-means?

How does K-means clustering work?

Code implementation