Clustering Model and Prediction
Learn how to use H2O's KMeans algorithm for building insightful clustering models.
We'll cover the following...
In this lesson, we’ll cover the implementation of the H2OKMeansEstimator
algorithm, including how it works, the key parameters involved, and how to interpret the results. We’ll learn how to preprocess the RFM dataset, apply the clustering algorithm using H2OKMeansEstimator
, and visualize the clustering results. By the end of this lesson, we’ll have the knowledge and practical skills to perform clustering on different datasets and uncover valuable insights for the business or research. Let’s dive in and learn about the clustering models in detail.
Training H2OKMeansEstimator
Let’s work with the RFM dataset to build a clustering model leveraging the H2OKMeansEstimator
algorithm from the H2O library. This dataset provides information on retail customers and their purchasing history commonly used in marketing and retail analytics. The dataset consists of Recency, Frequency, and Monetary columns. Let’s have a look:
# Reading dataset and checking some samplesdata = pd.read_csv(filepath+'RFM.csv')data.head()data.describe()
To train our model, we must carefully select the appropriate model parameters, such as the number of clusters, the initialization strategy, and whether we want to enable or disable the iterative estimation of the number of clusters (≤k) and standardization. Additionally, we need to specify the predictor variables by setting the x
parameter. Once we’ve trained our model, we can use it to assign cluster labels to customers based on their recency, frequency, and monetary value.
from h2o.estimators import H2OKMeansEstimator as H2OKMeans# log transformation for skewness removaldata['Frequency'] = data['Frequency'].log10()data['Monetary'] = data['Monetary'].log10()splits = data.split_frame(ratios = [0.7, 0.15], seed = 1)train = splits[0]valid = splits[1]test = splits[2]X = ['Recency', 'Frequency', 'Monetary']# Set up the H2O KMeans parameterskmeans = H2OKMeans(k=100, estimate_k = True, seed=42,standardize=True, max_iterations=100)# Train the modelkmeans.train(x=X, training_frame=train, validation_frame=valid)print(kmeans.model_performance(test_data=train))print(kmeans.model_performance(test_data=valid))print(kmeans.model_performance(test_data=test))
We have successfully trained our clustering model using the H2OKMeansEstimator
algorithm.
To eliminate skewness from our dataset, we applied a log transformation to the
Frequency
andMonetary
columns (lines 4–5).Afterward, we divided the dataset into three subsets: training, validation, and test, with a ratio of
(lines 7–10). Next, we trained the
H2OKMeansEstimator
algorithm using the three input features,Recency
,Frequency
, andMonetary
, and examined its performance on all three subsets of data (lines 12–22).
In many cases, we may not have prior knowledge of the optimal number of clusters. When that happens, we can leverage the estimate_k
parameter while training the H2OKMeansEstimator
algorithm. This allows the algorithm to automatically estimate the best number of clusters, k
, from the available options.
By employing this approach, our dataset has been grouped into three clusters, and we can get the relevant model metrics with the model_performance
method from the h2o
library. With the model_performance
output, we can get the following information:
Total within cluster sum of square error (
): This metric indicates how close the data points are within a cluster from its respective cluster centroid. Between cluster sum of ...