Data Classification

Learn how to split the provided samples into training and testing samples and ponder about data classification.

Splitting the data

In effect, splitting the data into two subsets can be defined around some filter functions. We’ll avoid Python for a moment and focus on the conceptual math to make sure we have the logic completely correct before diving into code. Conceptually, we have a pair of functions, e(sis_i) and r(sis_i), that decide if a sample, (sis_i), is for testing, ee, or training, rr. These functions are used to partition the samples into two subsets. (If training and testing didn’t both begin with t, we’d have an easier time finding names.) It might help to think about e(sis_i) for evaluation and testing, and r(sis_i) for running a real classification.)

It’s simpler if these two functions are exclusive e(si)e(s_i) = ¬r(si)¬r(s_i). (We’ll use ¬ instead of the longer not.) If they are proper inverses of each other, this means we only need to define one of the two functions:

Get hands-on with 1200+ tech skills courses.