Challenge: Analyzing Clustered Data

Using the coding playground at the end of this lesson, perform the tasks highlighted in the “Problem statement” section.

Problem statement

In the playground below, we have a full end-to-end process of training and using a clustering model against housing market data. We tell the model to create three distinct clusters and then feed the data to it.

Once the model is trained, we pass an example of a house to it to predict what cluster it would belong in and print this information in the console. We also print distance measures. We then print the distributions of the training data across all clusters.

However, neither the cluster ID nor the distances provide any meaningful information to us. Therefore, you need to make some changes.

First, you don’t need to print the distances into the console. Second, you need to categorize property clusters based on the following:

  • High-end

  • Cheap

  • Mid-range

To do so, you need to create a dictionary where uint, the cluster ID, would be the key and string, and the category name would be the corresponding value. You will then need to complete the following tasks:

  • Add the high-end key to the dictionary.

  • Add the cheap cluster ID to the dictionary.

  • Use human-readable category labels in the distribution.

Add the high-end cluster ID to the dictionary

In line 35 of the Program.cs class, we pass an example of a high-end property into the predictor. The property has five bedrooms and three bathrooms and costs more than two million. Therefore, the cluster ID that the predictor will return would represent a high-end property. You will need to add the cluster ID as the key to the dictionary against the value of High-end. You also need to print in the console that the cluster ID represents the high-end properties.

Add the cheap cluster ID to the dictionary

Next, you'll need to select an example of a property from the lisbon_house_prices.csv file in the Data folder that has one bedroom, costs less than three thousand per square meter, and has a total area of less than 30. You'll need to pass this property to the predictor and assign the returned cluster ID as the key for the Cheap category. Like before, you need to print into the console what the cheap cluster ID is.

Use human-readable category labels in the distribution

By this point, you'll have known the cluster IDs for high-end and cheap properties. Now, you need to use human-readable cluster labels in the data distribution output. If you come across a cluster ID you haven’t seen before, you need to add it to the dictionary as Mid-range.

Get hands-on with 1400+ tech skills courses.