...

/

Parsing HTML Attributes and Parameters

Parsing HTML Attributes and Parameters

Learn and practice how to implement a fragmented attribute parser.

Implementing the FragmentAttributeParser class

Parsing the parameters and attributes of our fragments will be similar to extracting the HTML fragments themselves. Instead of checking for characters such as the greater-than and less-than characters, we will use the presence of whitespace to determine attribute and parameter boundaries. Because parameter values can contain valid whitespace, we will use the same technique to skip over strings when parsing parameters to advance our parser over those problematic areas. Our implementation of this can be found below:

Press + to interact
<?php
class FragmentAttributeParser extends BaseFragmentParser
{
public function parse($fragment)
{
$this->resetState();
$this->string = new Utf8StringIterator(
$fragment->innerContent->content
);
$tempAttributes = [];
$attributes = [];
for ($i = 0; $i < count($this->string); $i++) {
$this->checkCurrentOffsets($i);
if (ctype_space($this->current)) {
$this->buffer = '';
continue;
}
if ($this->isStartOfString()) {
$i = $this->scanToEndOfString($i);
$this->checkCurrentOffsets($i);
} else {
$this->buffer .= $this->current;
}
if ($this->next == null || ctype_space($this->next)) {
$tempAttributes[] = [
$this->buffer, $i
];
$this->buffer = '';
continue;
}
}
foreach ($tempAttributes as $tempAttribute) {
$attribute = new FragmentAttribute();
$attribute->content = $tempAttribute[0];
// Calculate the attribute's start and end
// positions relative to the original doc.
$attribute->endPosition = $tempAttribute[1] +
$fragment->innerContent->startPosition;
$attribute->startPosition = $attribute->endPosition -
str($attribute->content)->length() + 1;
// Extract name/values, if present.
$parts = str($attribute->content)->explode('=', 2);
if ($parts->count() == 2) {
$attribute->type = AttributeType::Parameter;
$attribute->name = $parts->first();
$attribute->value = $parts->last();
} else {
$attribute->name = $attribute->content;
$attribute->type = AttributeType::Attribute;
}
$attributes[] = $attribute;
}
return $attributes;
}
}

Like our fragments parser, much of what is happening in the above code is similar to what we’ve seen before. The more notable differences can be found between lines 18 and 21; within these lines, if the current character is whitespace, we clear the contents of our internal buffer and move to the next position within the string. This will ensure that our parsed attributes and parameters do not have extraneous whitespace on either side and provide a simple way to determine boundaries. Between lines 30 and 36 is where we create a list of temporary attributes.

Our list of temporary attributes contains the buffer’s content at that time and where the position within the HTML ...