How to parse a web page in PHP

Web page parsing, also known as web scraping or web crawling, is a technique to extract “structured” data from an HTML document. In this modern era of web development, extracting specific information by parsing a web page is a valuable skill and a common requirement. Web scraping requires navigating through the Document Object Modal (DOM), the hierarchical structure of an HTML document. We usually perform web scraping on a website that does not offer data retrieval through API or in a downloadable form.

Use cases for web page parsing

Web page parsing has hundreds of use cases. Some of the most common are listed below:

  • Lead generation: Companies use web crawling to extract contact information from forums or social media websites to generate leads and find potential customers.

  • Content/news aggregation: News aggregator services such as Google News use web crawling to collect articles and posts from around the world and show them to their users.

  • Price monitoring: E-commerce websites use web crawling to parse their competitors’ websites to extract the pricing of different products. Then, they use this information to offer more competitive pricing to their customer.

  • Search engine indexing: Search engines like Google and Bing parse billions of web pages daily to index new pages and retrieve information against search queries.

  • Weather forecasting: The climate researchers parse the web pages of different weather forecasting providers to monitor climate patterns.

The process of parsing a web page

A typical flow to scrap a web page is as follows:

  • Making an HTTP request: The first step is to send an HTTP request to retrieve the target web page as HTML.

  • Parsing HTML: The next step is to parse the DOM and navigate through different elements to reach the specific area of the website containing the required data.

  • Extracting data: After reaching the specific area of the website, it’s time to extract that data. The data is typically in the form of text, retrieved from the attributes of the elements.

  • Cleaning data (optional): Once we have the required data, we might need to clean it. For example, we might need to split the text with a particular delimiter.

  • Displaying or saving data: When we have the required data in the desired format, we can display it somewhere or save it permanently.

Using Symfony DomCrawler to parse a quotes website

While many packages are available to parse a web page in PHP, Symfony DomCrawler provides an easy and convenient way to traverse the DOM. To start with web scrapping, let’s extract the quotes and their authors listed on the website quotes.toscrape.com using Symfony DomCrawler.

Installing Symfony DomCrawler

Let’s install Symfony DomCrawler with Composer, a dependency management tool for PHP:

composer require symfony/dom-crawler
Installing Symfony DomCrawler using Composer

Analyzing the DOM

Before we can parse a web page, we must analyze its DOM to see which page elements we need to parse to extract our required data. Let’s explore the DOM of the quotes website:

<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“Contents of the quote...”</span>
<span>by <small class="author" itemprop="author">Author of the quote...</small>
<a href="/author/author-name">(about)</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world">
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
</div>
</div>

Since the web page has structured data, all the quotes will follow the same HTML structure.

We can see that the quote text is wrapped in a span tag with the text class, and the author’s name is wrapped in another span tag with the author class.

Using XPath to extract the data

The XML Path Language (XPath) is a query language to navigate through the elements in an HTML document. It offers a simple way to target the required elements in the DOM. For example, to select all the elements with the quote class, we can use the following syntax:

//div[@class='quote']

In the example above:

  • // selects elements from any location in the DOM.

  • div selects all div elements from the DOM.

  • [@class='quote'] selects all the elements with the attribute class equal to the value quote.

Extracting quotes

Let’s extract all the quotes from the quotes website and display the quote text and author name.

<?php

require 'vendor/autoload.php';
use Symfony\Component\DomCrawler\Crawler;

function fetchHTML($url) {
    $ch = curl_init($url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    $html = curl_exec($ch);
    curl_close($ch);
    return $html;
}

$url = "https://quotes.toscrape.com/";
$html = fetchHTML($url);

$crawler = new Crawler($html);

$quotes = $crawler->filterXPath('//div[@class="quote"]')->each(function (Crawler $node, $i) {
    $text = $node->filterXPath('.//span[@class="text"]')->text();
    $author = $node->filterXPath('.//small[@class="author"]')->text();
    return compact('text', 'author');
});
foreach ($quotes as $quote) {
    echo "Quote: {$quote['text']}<br>";
    echo "Author: {$quote['author']}<br>";
    echo "<br><br>";
}
?>
Extracting quotes, text, and author name

Code explanation

  • Lines 3–4: We import and use the Symfony DomCrawler class.

  • Lines 6–12: We implement a function to take a URL as input and return its HTML content as output.

  • Lines 14–15: We pass the URL to fetchHTML and get its HTML content.

  • Line 17: We create a new crawler instance by passing the HTML.

  • Line 19: We get and loop over all the div elements with the attribute class="quote".

  • Line 20: We get the quoted text by extracting the content of the span element with the attribute class="text".

  • Line 21: We get the quote author by extracting the content of the small element with the attribute class="author".

  • Line 22: We return all the quotes as an array.

  • Lines 24–28: We display all the quotes and their authors on the screen.

Conclusion

Extracting the web pages with PHP is an effective way to extract meaningful data. It offers a way to extract the data even when there’s no API or any other official means provided by that website. However, we should consider a few factors while scrapping a web page, such as respecting their terms of service and implementing request throttling.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved