CSV Parsing: The Property
Learn to design the right properties and generators for our example application.
We'll cover the following
CSV format
CSV is a loose format that nobody really implements the same way. This can be quite confusing even though RFC 4180 tries to provide a simple specification:
- Each record is on a separate line, separated by CRLF (a
\r
followed by a\n
). - The last record of the file may or may not have a CRLF after it. This is optional.
- The first line of the file may be a header line, ending with a CRLF. In this case, the problem description includes a header, which will be assumed to always be there.
- Commas go between fields of a record.
- Any spaces are considered to be part of the record. (The example in the problem description doesn’t respect that, as it adds a space after each comma even though it’s clearly not part of the record.)
- Double quotes (
"
) can be used to wrap a given field. Fields that contain line breaks (CRLF), double quotes, or commas must be wrapped in double-quotes. - All records in a document contain the same number of fields.
- A double-quote within a double-quoted field can be escaped by preceding it with another double quote (
"a""b"
meansa"b
). - Field values or header names can be empty.
- Valid characters for records include only
! #$%&'()*+-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~
This means the official CSV specs won’t let us have employees whose names don’t fit that pattern. We can always extend the tests later for better customizations, but for now, we’ll implement this specification, and as far as our program is concerned, whatever we find in the CSV file will be treated as correct.
For example, if a row contains a
, b
, c
, we’ll consider the three values to be "a"
, " b"
, and " c"
with the leading spaces, and patch them up in our program, rather than modifying the CSV parser we’ll write. We’ll do this because, in the long run, it’ll be simpler to reason about our system if all independent components are well-defined reusable units, and we instead only need to reason about adapters to glue them together.
Writing tests
Let’s start by writing tests first, so we can think of properties before writing the code.
Selecting the approach
Let’s start by writing tests first, so we can think of properties before writing the code. The first step here would be to decide which of the approaches to take in writing the tests. We had a look at these approaches in the Thinking in Properties section. They were:
- Modeling: Makes a simpler, less efficient version of CSV parsing and compares it to the real one.
- Generalizing example tests: A standard unit test would dump, read and check that data matches our expectations. Generalizing makes one equivalent to all examples.
- Invariants: Find a set of rules that, put together, represent CSV operations.
- Symmetric properties: Serializes and unserializes the data, ensuring results are the same.
The last technique among these is the most interesting one for parsers and serializers since we need encoded data to validate decoding, and that decoding is required to make sure encoding works well. Both sides will need to agree and be tested together no matter what. Plugging both into a single property tends to be ideal. All we need after that is to anchor the property with either a few traditional unit tests or simpler properties to make sure expectations are met.
Writing the generators
Since we’ll do an encoding/decoding sequence, generating Elixir terms that are encodable in CSV should be the first step. CSV contains rows of text records separated by commas. We’ll start by writing generators for the text records themselves, and assemble them later. We’ll currently stay with a string, the simplest CSV encoding possible. How we handle integers, dates, and so on, tends to be application-specific.
Because CSV is a text-based format, it contains some escapable sequences, which turn out to always be problematic no matter what format we’re handling. In CSV, as we’ve seen in the specification, escape sequences are done through wrapping strings in double quotes, with some special cases for escaping double quotes themselves. For now, let’s not worry about it, besides making sure the case is well-represented in our data generators. Here is what the generators would look like:
def field() do
oneof([unquoted_text(), quotable_text()])
end
# using charlists for the easy generation
def unquoted_text() do
let chars <- list(elements(textdata())) do
to_string(chars)
end
end
def quotable_text() do
let chars <- list(elements('\r\n",' ++ textdata())) do
to_string(chars)
end
end
def textdata() do
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789' ++
':;<=>?@ !#$%&\'()*+-./[\\]^_`{|}~'
end
The field()
generator depends on two other generators, unquoted_text()
and quotable_text()
. The former will be used to generate Elixir data that will require no known escape sequence in it once converted, whereas the latter will be used to generate sequences that may require escaping. Both generators rely on textdata()
, which contains all the valid characters allowed by the specification.
Note that we’ve made an Elixir string for textdata()
with alphanumeric characters coming first and that we pass it to list(elements())
. This approach will randomly pick characters from textdata()
to create a string. If one of our tests fails, elements()
shrinks toward the first elements of the list we pass to it. PropEr will then try to generate counterexamples that are more easily human-readable when possible by limiting the number of special characters they contain. Rather than generating {#$%a~
, it might try to generate ABFe#c
once a test fails.
We can now put these records together. A CSV file will have two types of rows:
- A header on the first line.
- Data entries in the following lines.
In any CSV document, we expect the number of columns to be the same on all of the rows.
def header(size) do
vector(size, name())
end
def record(size) do
vector(size, field())
end
def name() do
field()
end
These generators basically generate the same types of strings for both headers and rows, with a known fixed length as an argument. The name()
generator is defined as field()
because they have the same requirements specification-wise, but it’s useful to give each generator a name according to its purpose. If we end up modifying or changing requirements on one of them, we can do so with minimal changes. We can then assemble everything together into one list of maps that contain all the data we need like this:
def csv_source() do
let size <- pos_integer() do
let keys <- header(size + 1) do
list(entry(size + 1, keys))
end
end
end
def entry(size, keys) do
let vals <- record(size) do
Map.new(Enum.zip(keys, vals))
end
end
The csv_source()
generator picks up a Size
value that represents how many entries will be in each row. By putting it in a let
macro, we ensure that whatever the expression that uses Size
is, it uses a discrete value, and not the generator itself. This will allow us to use Size
multiple times safely with the same value always in the second let
macro. That second macro generates one set of headers —the keys of every map— and then uses them to create a list of entries.
The entries themselves are specified by the entry
generator, which creates a list of record values and pairs them up with the keys from csv_source()
into a map. Let’s take a look at what the generated values would look like.
Note: Running the generator at this point would be an instant failure since we haven’t written the code to go with it.
The commands to run in the shell are:
ExUnit.start()
c "test/csv_test.exs"
:proper_gen.sample(CsvTest.csv_source())
Get hands-on with 1400+ tech skills courses.