Clustering Model and Prediction

Learn how to use H2O's KMeans algorithm for building insightful clustering models.

In this lesson, we’ll cover the implementation of the H2OKMeansEstimator algorithm, including how it works, the key parameters involved, and how to interpret the results. We’ll learn how to preprocess the RFM dataset, apply the clustering algorithm using H2OKMeansEstimator, and visualize the clustering results. By the end of this lesson, we’ll have the knowledge and practical skills to perform clustering on different datasets and uncover valuable insights for the business or research. Let’s dive in and learn about the clustering models in detail.

Training H2OKMeansEstimator

Let’s work with the RFM dataset to build a clustering model leveraging the H2OKMeansEstimator algorithm from the H2O library. This dataset provides information on retail customers and their purchasing history commonly used in marketing and retail analytics. The dataset consists of Recency, Frequency, and Monetary columns. Let’s have a look:

Press + to interact
# Reading dataset and checking some samples
data = pd.read_csv(filepath+'RFM.csv')
data.head()
data.describe()

To train our model, we must carefully select the appropriate model parameters, such as the number of clusters, the initialization strategy, and whether we want to enable or disable the iterative estimation of the number of clusters (≤k) and standardization. Additionally, we need to specify the predictor variables by setting the x parameter. Once we’ve trained our model, we can use it to assign cluster labels to customers based on their recency, frequency, and monetary value.

Press + to interact
from h2o.estimators import H2OKMeansEstimator as H2OKMeans
# log transformation for skewness removal
data['Frequency'] = data['Frequency'].log10()
data['Monetary'] = data['Monetary'].log10()
splits = data.split_frame(ratios = [0.7, 0.15], seed = 1)
train = splits[0]
valid = splits[1]
test = splits[2]
X = ['Recency', 'Frequency', 'Monetary']
# Set up the H2O KMeans parameters
kmeans = H2OKMeans(k=100, estimate_k = True, seed=42,
standardize=True, max_iterations=100)
# Train the model
kmeans.train(x=X, training_frame=train, validation_frame=valid)
print(kmeans.model_performance(test_data=train))
print(kmeans.model_performance(test_data=valid))
print(kmeans.model_performance(test_data=test))

We have successfully trained our clustering model using the H2OKMeansEstimator algorithm.

  • To eliminate skewness from our dataset, we applied a log transformation to the Frequency and Monetary columns (lines 4–5).

  • Afterward, we divided the dataset into three subsets: training, validation, and test, with a ratio of 0.7:0.15:0.150.7:0.15:0.15 (lines 7–10).

  • Next, we trained the H2OKMeansEstimator algorithm using the three input features, Recency, Frequency, and Monetary, and examined its performance on all three subsets of data (lines 12–22).

In many cases, we may not have prior knowledge of the optimal number of clusters. When that happens, we can leverage the estimate_k parameter while training the H2OKMeansEstimator algorithm. This allows the algorithm to automatically estimate the best number of clusters, k, from the available options.

By employing this approach, our dataset has been grouped into three clusters, and we can get the relevant model metrics with the model_performance method from the h2o library. With the model_performance output, we can get the following information:

  • Total within cluster sum of square error (SSWithin\mathrm{SS_{Within}}): This metric indicates how close the data points are within a cluster from its respective cluster centroid.

  • Between cluster sum of ...