The Essential k-NN Algorithm

Learn about the k-NN algorithm and its implementation using different modules.

Overview

We can summarize the kk-NN algorithm as having the following steps:

  1. Create a list of all (distance, training sample) pairs.
  2. Sort these in ascending order.
  3. Pick to the first kk, which will be the k nearest neighbors.
  4. Chose the mode (the highest frequency) label for the kk nearest neighbors.

The implementation would look like this:

Press + to interact
class Measured(NamedTuple):
distance: float
sample: TrainingKnownSample
def k_nn_1(
k: int, dist: DistanceFunc, training_data: TrainingList,
unknown: AnySample) -> str:
distances = sorted(
map(
lambda t: Measured(dist(t, unknown), t), training_data) )
k_nearest = distances[:k]
k_frequencies: Counter[str] = collections.Counter(
s.sample.sample.species for s in k_nearest
)
mode, fq = k_frequencies.most_common(1)[0]
return mode

While clear, this does accumulate a large number of distance values in the distances list object, when only kk are actually needed. The sorted() function consumes the source generator and creates a (potentially large) list of intermediate values.

One of the high-cost parts of this specific kk-NN algorithm is sorting the entire set of training data after the distances have been computed. We summarize the complexity with the description as an O(nlogn)O(n \log n) operation. A way to avoid cost is to avoid sorting the entire set of distance computations.

Steps ...