Introduction to lxml
Learn how to scrape and navigate the HTML DOM using XPath.
We'll cover the following...
Now that we have covered XPath, it's time to put our knowledge into practice and explore its practical applications in extracting data from static and dynamic websites.
lxml
Although Beautiful Soup alone does not have built-in support for XPath, we can leverage another library to harness the power of XPath. lxml is a highly valuable Python library for web scraping. While its primary focus is parsing XML, it also offers support for HTML. Notably, lxml allows us to utilize both XPath and CSS selectors, making it a versatile tool for data extraction. As a result, it serves as an excellent alternative to Beautiful Soup.
Usage
Let's take a look at how we can use it.
import requestsfrom lxml import htmlresponse = requests.get("https://books.toscrape.com/")DOM = html.fromstring(response.content)print(DOM.cssselect("title")[0].text)print(DOM.xpath("//h3/a/@title")[:10])
The code is quite similar to Beautiful Soup parsing code:
Line 5: We begin by requesting the URL, and then we pass the content to the parser, in this case,
lxml.html
.This step constructs the familiar DOM tree, allowing us to navigate it as we normally would. ...