Modeling Gherkin

Learn about how to write a Gherkin parser using Gherkin keywords.

Applied techniques: Writing a Gherkin parser

Gherkin is an indentation-based language that allows developers to write software tests in a way that reads like a natural language, such as English or French. We will not be looking to explain how to use Gherkin to run tests but rather explore the structure of the language and write a parser in PHP that will handle it. While we do not want to get into deep discussions on how tests written in Gherkin are eventually used, we need to look at quite a few language examples to get a sense of what we are dealing with before we start writing our parser. Let’s take a look at the following code example from the Gherkin reference:

Press + to interact
Feature: Guess the word
# The first example has two steps
Scenario: Maker starts a game
When the Maker starts a game
Then the Maker waits for a Breaker to join
# The second example has three steps
Scenario: Breaker joins a game
Given the Maker has started a game with the word "silky"
When the Breaker joins the Maker's game
Then the Breaker must guess a word with 5 characters

We can already see a few notable things in the code snippet above before we get into our parser implementation. The first thing to note is that a Gherkin file always begins with a Feature block and can contain multiple children. We also have two other block types called scenarios.

Gherkin uses a scenario to express a testable behavior within the feature test. Every Gherkin scenario belongs to a feature, and a scenario can have many steps.

Gherkin’s parent-child relationships are indicated by the indentation of the file. In the code snippet above, we have the following program structure:

  1. Feature
    1. Scenario
      1. Step
      2. Step
    2. Scenario
      1. Step
      2. Step
      3. Step

The empty lines on lines 2 and 7 can largely be ignored for our purposes. Lines 3 and 8 contain an interesting construct we need to consider: the Gherkin comment.

The Gherkin comment

Gherkin comments can appear on any line, with any number of leading whitespace, but will always start with the # character. These few facts will make it relatively painless for us to parse these later.

We can update our mental program structure to:

  1. Feature
    1. Comment
    2. Scenario
      1. Step
...