Distributed Machine Learning and Its Implementation with H2O/

...

/

Clustering Model and Prediction

In this lesson, we’ll cover the implementation of the H2OKMeansEstimator algorithm, including how it works, the key parameters involved, and how to interpret the results. We’ll learn how to preprocess the RFM dataset, apply the clustering algorithm using H2OKMeansEstimator, and visualize the clustering results. By the end of this lesson, we’ll have the knowledge and practical skills to perform clustering on different datasets and uncover valuable insights for the business or research. Let’s dive in and learn about the clustering models in detail.

Training `H2OKMeansEstimator`

Let’s work with the RFM dataset to build a clustering model leveraging the H2OKMeansEstimator algorithm from the H2O library. This dataset provides information on retail customers and their purchasing history commonly used in marketing and retail analytics. The dataset consists of Recency, Frequency, and Monetary columns. Let’s have a look:

Python 3.8

from h2o.estimators import H2OKMeansEstimator as H2OKMeans
# log transformation for skewness removal
data['Frequency'] = data['Frequency'].log10()
data['Monetary'] = data['Monetary'].log10()
splits = data.split_frame(ratios = [0.7, 0.15], seed = 1)
train = splits[0]
valid = splits[1]
test = splits[2]
X = ['Recency', 'Frequency', 'Monetary']
# Set up the H2O KMeans parameters
kmeans = H2OKMeans(k=100, estimate_k = True, seed=42, 
                    standardize=True, max_iterations=100)
# Train the model
kmeans.train(x=X, training_frame=train, validation_frame=valid)
print(kmeans.model_performance(test_data=train))
print(kmeans.model_performance(test_data=valid))
print(kmeans.model_performance(test_data=test))

We have successfully trained our clustering model using the H2OKMeansEstimator algorithm.

To eliminate skewness from our dataset, we applied a log transformation to the Frequency and Monetary columns (lines 4–5).
Afterward, we divided the dataset into three subsets: training, validation, and test, with a ratio of $0.7:0.15:0.15$ (lines 7–10).
Next, we trained the H2OKMeansEstimator algorithm using the three input features, Recency, Frequency, and Monetary, and examined its performance on all three subsets of data (lines 12–22).

In many cases, we may not have prior knowledge of the optimal number of clusters. When that happens, we can leverage the estimate_k parameter while training the H2OKMeansEstimator algorithm. This allows the algorithm to automatically estimate the best number of clusters, k, from the available options.

By employing this approach, our dataset has been grouped into three clusters, and we can get the relevant model metrics with the model_performance method from the h2o library. With the model_performance output, we can get the following information:

Total within cluster sum of square error ( $\mathrm{SS_{Within}}$ ): This metric indicates how close the data points are within a cluster from its respective cluster centroid.
Between cluster sum of square error ( $\mathrm{SS_{Between}}$ ...

Clustering Model and Prediction

Training H2OKMeansEstimator

Training `H2OKMeansEstimator`