Clustering with H2O

Learn the fundamentals of clustering and its implementation using the H2O package.

Introduction to clustering models

Unsupervised clustering models group similar data points based on their inherent patterns or similarities. Their goal is to find natural groupings within the data without any predefined labels or categories.

To achieve well-defined clusters, the algorithm measures the similarity or distance between data points and groups that are close together. The goal of unsupervised clustering is to maximize intracluster similarity while maximizing intercluster dissimilarity. In other words, we want the data points within a cluster to be similar to each other, and at the same time, we want the clusters to be distinct from other clusters.

By optimizing these two factors, unsupervised clustering algorithms can identify meaningful and well-separated clusters in the data. They allow us to uncover underlying patterns, segment the data into distinct groups, and gain insights into the structure of the dataset.

Approaches to computing similarity

The underlying mathematics in clustering involves measuring the similarity or dissimilarity between data points. The most common approach is to use distance measures, such as Euclidean distanceA measure of how close two points or vectors are in a multidimensional space based on the straight-line distance between them. The smaller the Euclidean distance, the more similar the points, indicating they are closer together in space. or cosine similarityA measure of how similar two vectors or documents are based on the cosine of the angle between them. When the vectors are pointing in the same direction, the cosine similarity is closer to 1, indicating high similarity, and vice versa., to quantify the similarity between data points. These measures calculate the geometric or angular difference between the feature vectors representing the data points. ...