Fundamentals of Machine Learning: A Pythonic Introduction/

...

Introduction to Clustering

Explore clustering algorithms and their limitations with sample datasets and similarity functions.

We'll cover the following...

What are similarity and dissimilarity functions?
- Euclidean distance
- Cosine similarity
Practical limitations to clustering
Categories of clustering

In life, we try to group many things to understand them better or even simplify them. Let’s say you and your friend are trying to classify video games. Your friend might want to classify them based on genre and end up with a collection of all the video games in neat little clusters. While classifying video games by their developers, you might have a different collection from your friend’s collection.

> Note: Both collections originated from the same data pool (video games), and both have learned something interesting from the data.

In machine learning, we do something similar. We ask the machine to group the data to get meaningful information. This grouping of unlabeled data is called clustering. This grouping is based on similarity among data points. After clustering, the data points should be similar within clusters and dissimilar across clusters.

We can see a lot of data points below:

Press + to interact

In the illustration above, we have a lot of data. But this data is spread out elegantly, marking clear boundaries and present in 2D. In the real world, the data is hardly this clean, and beyond three features, showing it graphically becomes near impossible.

Note: Clustering algorithms rely highly on the similarity/dissimilarity functions they use.

What are similarity and dissimilarity functions?

A similarity function is responsible for finding the magnitude of the shared similarity between any pair of data points. It’s a numerical measure of how alike two objects are. This means that whenever we have two data points, we can find how similar they are to each other. A dissimilarity function, on the other hand, computes the dissimilarity between any pair of data points. It’s a numerical measure of how different two data objects are.

Euclidean distance

The Euclidean distance $E$ between two points $\bold x=(x_1, x_2, \dots, x_d)$ and $y = (y_{1}, y_{2}, \dots, y_{d}$ ...

Course Overview

Supervised Learning

Detect Cyber Intrusion Using Machine Learning

Clustering

Project: Bag of Visual Words

Generalized Linear Regression

Face Recognition Using Kernel Linear Discriminant

Support Vector Machine

Logistic Regression

Ensemble Learning

Early Stage Diabetes Prediction Using Ensemble Learning

Decoding Dimensions: PCA and Autoencoders

Image Reconstruction Using PCA

Image Colorization using Autoencoders

Colorful Face Generation with VAEs

Appendix

Wrapping Up

How to Predict the Traffic Volume Using Machine Learning

Introduction to Clustering

What are similarity and dissimilarity functions?

Euclidean distance