Reversing Double Encoded HTML Entities
Learn and practice how to work with HTML entities using a crawler instance.
We'll cover the following...
Broadly speaking, the questions relating to working with HTML usually fall into three categories: generating new HTML, manipulating existing HTML, and analyzing HTML-like documents with other languages embedded in them. The first category is one we are all generally familiar with because the output of most PHP applications is an HTML string, whether produced entirely in PHP or through some other purpose-built templating engine, such as Blade. However, things get particularly interesting and nuanced in the last two categories.
An example of manipulating existing HTML might be adding CSS classes to specific elements. The desire here is usually to express the targeted HTML elements using CSS’s selector syntax, which we will cover later in this chapter. Another example would be to reverse the double-encoding of HTML entities in the generated output, which can commonly occur accidentally with many layers of HTML output generated by content management systems.
The third category—analyzing HTML documents with other embedded languages—is the much more interesting of the three. The complexity within this category is that the embedded languages typically significantly make the overall document invalid HTML. Let’s consider the code below, which is a snippet of Antlers, the templating language for Statamic, a Laravel-based content management system:
<{{ as or 'a' }} class="list of CSS classes">Element content.</{{ as or 'a' }}>
The exact semantics of what is happening in the code above are unimportant, except that we have some embedded code that will dynamically generate an HTML element’s name. These scenarios are relatively straightforward for us to visually analyze but add significant complexity when attempting to analyze the structure of our HTML documents using existing third-party libraries, ...