Multiple Partitions

Learn how to divide a dataset into multiple partitions based on certain criteria.

Overview

Our goal is to separate testing and training data. There’s a tiny bump in the road, however, called deduplication. The statistical measures of overall quality rely on the training and testing sets being independent; this means we need to avoid duplicate samples being split between testing and training sets. Before we can create testing and training partitions, we need to find any duplicates.

We can’t—easily—compare each sample with all of the other samples. For a large set of samples, this may take a very long time. A pool of ten thousand samples would lead to 100 million checks for duplication. This isn’t practical. Instead, we can partition our data into subgroups where the values for all the measured features are likely to be equal. Then, from those subgroups, we can choose testing and training samples. This lets us avoid comparing every sample with all of the other samples to look for duplicates.

Mathematical interpretation of partitioning

If we use Python’s internal hash values, we can create buckets containing samples that may have equal values. In Python, if items are equal, they must have the same integer hash value. The inverse is not true: items may coincidentally have the same hash value, but may not actually be equal.

Formally, we can say this:

𝑎=𝑏h(𝑎)=h(𝑏)𝑎 = 𝑏 ⇒ h(𝑎) = h(𝑏)

That is, if two objects in Python, aa and bb, are equal, they must also have the same hash value h(𝑥)h(𝑥). The reverse is not true because equality is more than a simple hash value check; it’s possible that h(𝑎)=h(𝑏)𝑎𝑏h(𝑎) = h(𝑏) ∧ 𝑎 ≠ 𝑏; the hash values may be the same, but the underlying objects aren’t actually equal. We call this a hash collision of two unequal values.

Continuing this thought, the following is a matter of definition for modulo:

h(𝑎)=h(𝑏)h(𝑎)=h(𝑏)mod(m)h(𝑎) = h(𝑏) ⇒ h(𝑎) = h(𝑏) mod (m)

If two values are equal, they are also equal to any modulo of those values. When we want to know if a==ba == b, we can ask if aa % 2 == b % 2; if both numbers are odd or both numbers are even, then there’s a chance aa and bb could be equal. If one number is even and the other is odd, there’s no way they can be equal.

For complex objects, we can use hash(a)hash(a) % m == hash(b) % m. If the two hash values, modulo mm, are the same, then the hash values could be the same, and the two objects, aa and bb, could also be equal. We know it’s possible for several objects to have the same hash value, and even more objects to have the same hash value modulo mm.

While this doesn’t tell us if two items are equal, this technique limits the domain of objects required for exact equality comparison to very small pools of a few items instead of the entire set of all samples. We can avoid duplicates if we avoid splitting up one of these subgroups.

Splitting samples

Here’s a view of seven samples broken into three subgroups based on their hash codes modulo 3. Most of the subgroups have items that are potentially equal, but actually aren’t equal. One of the groups has an actual duplicate sample:

Get hands-on with 1400+ tech skills courses.