Case Study

Learn how to compute different distance computation algorithms using sympy, create unit tests, and analyze coverage for thorough validation.

We’ll return to some material from an earlier chapter and apply some careful testing to be sure we’ve got a good, workable implementation. We’ve looked at the distance computations that are part of the kk-nearest neighbors classifier and several computations that produced slightly different results:

  • Euclidean distance: This is the direct line from one sample to another.

  • Manhattan distance: This follows streets-and-avenues around a grid (like the city of Manhattan), adding up the steps required along a series of straight-line paths.

  • Chebyshev distance: This is the largest of the streets-and-avenues distances.

  • Sorensen distance: This is a variation of the Manhattan distance that weights nearby steps more heavily than distant steps. It tends to magnify small distances, making more subtle discriminations.

These algorithms all produce distinct results from the same inputs; they all involve complex-looking math, and they all need to be tested in isolation to ensure we have implemented them correctly. We’ll start with unit tests of the distances.

Unit testing the distance classes

We need to create some test cases for each distance computation algorithm. When we look at the various equations, we can see that there are four pairs of relevant values from two samples: the sepal length and width, and the petal length and width. To be extremely thorough, we could create at least 1616 distinct cases for each algorithm:

  • Case 0: All four values are the same; the distance should be zero.

  • Cases 1–4: One of the four values is different between the two samples. For example, a test sample might have measurements of ("sepal_length": 5.1, "sepal_width": 3.5, "petal_length": 1.4, "petal_width": 0.2), where as a training sample might have measurements of ("sepal_length": 5.2, "sepal_width": 3.5, "petal_length": 1.4, "petal_width": 0.2); only one of these values is distinct.

  • Cases 5–10: A pair of values is different.

  • Cases 11–14: Three values are different between the two samples.

  • Case 15: All four values are different.

In addition, the concepts of equivalence partitioning and boundary value analysis suggest that we also need to locate values where there is a profound state change. For example, invalid values will raise exceptions, something that should also be tested. This can create a number of sub-cases within each of the cases enumerated above.

We won’t create all 16 cases for each of the four algorithms in this part of the case study. Instead, we’ll take a close look at whether or not all 16 cases are really required. To get started, we’ll limit ourselves to one case for each distance algorithm. This will be an example of case 15, where all four values of the two samples are different.

With mathematical results, we need to compute the expected answers outside the software we’re building. One trick that can be helpful when working with more advanced math is to use the sympy package as a way to check the math more carefully. For example, the Euclidean distance between a known sample, kk, and an unknown sample, uu, has the following formal definition:

Get hands-on with 1400+ tech skills courses.