CSV Parsing: The Property

Learn to design the right properties and generators for our example application.

CSV format

CSV is a loose format that nobody really implements the same way. This can be quite confusing even though RFC 4180 tries to provide a simple specification:

  • Each record is on a separate line, separated by CRLF (a \r followed by a \n).

  • The last record of the file may or may not have a CRLF after it. This is optional.

  • The first line of the file may be a header line, ending with a CRLF. In this case, the problem description includes a header, which will be assumed to always be there.

  • Commas go between fields of a record.

  • Any spaces are considered to be part of the record. The example in the problem description doesn’t respect that, since it adds a space after each comma even though it’s clearly not part of the record.

  • Double quotes (") can be used to wrap a given field. Fields that contain line breaks (CRLF), double quotes, or commas must be wrapped in double-quotes.

  • All records in a document contain the same number of fields.

  • A double-quote within a double-quoted field can be escaped by preceding it with another double quote ("a""b" means a"b).

  • Field values or header names can be empty.

  • Valid characters for records include only the following special and alphabetic characters:

    ! #$%&'()*+-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]`^_abcdefghijklmnopqrstuvwxyz{|}~

This means the official CSV specs won’t let us have employees whose names don’t fit that pattern. We can always extend the tests later for better customizations, but for now we’ll implement this specification, and as far as our program is concerned, whatever we find in the CSV file will be treated as correct.

For example, if a row contains a, b, c, we’ll consider the three values to be "a", " b", and " c" with the leading spaces, and patch them up in our program, rather than modifying the CSV parser we’ll write. We’ll do this because, in the long run, it’ll be simpler to reason about our system if all independent components are well-defined reusable units, and we instead only need to reason about adapters to glue them together.

Writing tests

Let’s start by writing tests first, so we can think of properties before writing the code.

Selecting the approach

The first step here would be to decide which of the approaches to take in writing the tests. We had a look at these approaches in the Thinking in Properties section. They were:

  • Modeling: Makes a simpler, less efficient version of CSV parsing and compares it to the real one.
  • Generalizing example tests: A standard unit test would dump, read, and check that data matches our expectations. Generalizing makes one property equivalent to all examples.
  • Invariants: Finds a set of rules that, put together, represent CSV operations.
  • Symmetric properties: Serializes and unserializes the data, ensuring results are the same.

The last technique among these is the most interesting one for parsers and serializers since we need encoded data to validate decoding, and that decoding is required to make sure encoding works well. Both sides will need to agree and be tested together no matter what. Plugging both into a single property tends to be ideal. All we need after that is to anchor the property with either a few traditional unit tests or simpler properties to make sure expectations are met.

Writing the generators

Since we’ll do an encoding/decoding sequence, generating Erlang terms that are encodable in CSV should be the first step. CSV contains rows of text records separated by commas. We’ll start by writing generators for the text records themselves, and assemble them later. We’ll currently stay with a string, the simplest CSV encoding possible. How we handle integers, dates, and so on, tends to be application-specific.

Because CSV is a text-based format, it contains some escapable sequences, which turn out to always be problematic no matter what format we’re handling. In CSV, as we’ve seen in the specification, escape sequences are done through wrapping strings in double quotes, with some special cases for escaping double quotes themselves. For now, let’s not worry about it, besides making sure the case is well-represented in our data generators. Here is what the generators would look like:

Get hands-on with 1400+ tech skills courses.