Introduction to lxml

Learn how to scrape and navigate the HTML DOM using XPath.

Now that we have covered XPath, it's time to put our knowledge into practice and explore its practical applications in extracting data from static and dynamic websites.

lxml

Although Beautiful Soup alone does not have built-in support for XPath, we can leverage another library to harness the power of XPath. lxml is a highly valuable Python library for web scraping. While its primary focus is parsing XML, it also offers support for HTML. Notably, lxml allows us to utilize both XPath and CSS selectors, making it a versatile tool for data extraction. As a result, it serves as an excellent alternative to Beautiful Soup.

Usage

Let's take a look at how we can use it.

Press + to interact
import requests
from lxml import html
response = requests.get("https://books.toscrape.com/")
DOM = html.fromstring(response.content)
print(DOM.cssselect("title")[0].text)
print(DOM.xpath("//h3/a/@title")[:10])

The code is quite similar to Beautiful Soup parsing code:

  • Line 5: We begin by requesting the URL, and then we pass the content to the parser, in this case, lxml.html.

    • This step constructs the familiar DOM tree, allowing us to navigate it as we normally would. ...