Case Study

Learn how to perform data validation and simplify the partitioning of samples.

In this chapter, we’ll continue developing elements of the case study. We want to explore some additional features of object-oriented design in Python. The first is what is sometimes called syntactic sugar, a handy way to write something that offers a simpler way to express something fairly complex. The second is the concept of a manager for providing a context for resource management.

In the previous chapter, we built an exception for identifying invalid input data. We used the exception for reporting when the inputs couldn’t be used.

Here, we’ll start with a class to gather data by reading the file with properly classified training and test data. In this chapter, we’ll ignore some of the exception-handling details so we can focus on another aspect of the problem: partitioning samples into testing and training subsets.

Input validation

The TrainingData object is loaded from a source file of samples, named bezdekIris.data. Currently, we don’t make a large effort to validate the contents of this file. Rather than confirm the data contains correctly formatted samples with numeric measurements and a proper species name, we simply create Sample instances, and hope nothing goes wrong. A small change to the data could lead to unexpected problems in obscure parts of our application. By validating the input data right away, we can focus on the problems and provide a focused, actionable report back to the user. Something like “Row 42 has an invalid petal_length value of 1b.25” with the line of data, the column, and the invalid value.

A file with training data is processed in our application via the load() method of TrainingData. Currently, this method requires an iterable sequence of dictionaries; each individual sample is read as a dictionary with the measurements and the classification. The type hint is Iterable[dict[str, str]]. This is one way the csv module works, making it very easy to work with.

Thinking about the possibility of alternative formats suggests the TrainingData class should not depend on the dict[str, str] row definition suggested by CSV file processing. While expecting a dictionary of values for each row is simple, it pushes some details into the TrainingData class that may not belong there. Details of the source document’s representation have nothing to do with managing a collection of training and test samples; this seems like a place where object-oriented design will help us disentangle the two ideas.

Conversion of input data into Python objects

In order to support multiple sources of data, we will need some common rules for validating the input values. We’ll need a class like this:

Get hands-on with 1400+ tech skills courses.