Web Scraping with Beautiful Soup

Discover the key features and applications of the Beautiful Soup library.

Up to this point, we have acquired the necessary skills to make HTTP requests and retrieve the HTML document from a website. It's time to delve deeper and extract the relevant information from the DOM.

Introduction

Beautiful Soup is a widely used Python library for web scraping and parsing HTML and XML documents. It offers a straightforward and flexible method to navigate and extract data from web pages, making it an indispensable tool for individuals who are required to gather and analyze data from the internet. Beautiful Soup can handle various parsing tasks, such as searching for and manipulating tags, attributes, and text within HTML documents. Due to its user-friendly syntax and robust functionality, it has become a preferred choice for developers and data scientists seeking to extract and process web data efficiently. In this lesson, we will explore the key features and applications of the Beautiful Soup library.

Note: It is recommended to inspect the URLs we will use in this lesson in a separate tab to gain a better understanding of the code paths.

Installation

We can install the Beautiful Soup library in any Python environment by running the command pip install beautifulsoup4.

Usage

Let’s briefly look at using it. The prettify() method produces a UnicodeUnicode is a standard encoding system that is used to represent characters from almost all languages. string nicely formatted with clear indentation, displaying the HTML in an organized manner.

Press + to interact
import requests
from bs4 import BeautifulSoup
response = requests.get("https://quotes.toscrape.com/")
soup = BeautifulSoup(response.content, 'html.parser')
print(soup.prettify())

Note: To handle the decoding process effectively, it is always better to use .content instead of .text while using Beautiful Soup.

Once the document is parsed, the output can be handled as a data structure (tree), and we can access its elements like any other Python object attribute.

Press + to interact
import requests
from bs4 import BeautifulSoup
response = requests.get("https://quotes.toscrape.com/")
soup = BeautifulSoup(response.content, 'html.parser')
print("Head tag's children: ", list(soup.head.children), "\n")
print("Page title: ", soup.title.string, "\n")
print("Sample quote: ", soup.find_all("span", {"class": "text"})[0].text, "\n")

Attributes

When discussing objects and attributes, reviewing several significant attributes and their outputs from the created tree is important.

Press + to interact
Sample HTML tree
Sample HTML tree
  • .tag

    • Returns HTML object with the tag selected

    • It can be used consecutively to reach a specific tab by following its children

  • .contents vs .children

    • Children of tags can be found in the .content list. Instead of retrieving the list, we may use the .children generator to iterate through a tag’s children.

  • .descendants

    • Recursively returns all the children and their children (all the sub-HTML trees) of the tag

  • .strings vs .stripped_strings

    • .strings returns all strings in the HTML document, including whitespace characters and strings nested within tags, while .stripped_strings returns only non-empty strings that contain visible text and strips leading and trailing whitespace from each string.

  • .parent vs .parents

    • .parent returns the immediate parent of the current tag, while .parents returns an iterator that allows iterating over all the parents of the current tag.

  • .next_sibling vs .previous_sibling

    • .next_sibling returns the following sibling tag of the current tag, while .previous_sibling returns the previous sibling tag of the current tag.

  • .next_element vs .previous_element

    • .next_element returns the next element in the parse tree after the current element while .previous_element returns the previous element in the parse tree before the current element.

Press + to interact
# .tag
soup.body.div.div.span = <span class='text'>"The world as.."</span>
# .contents
<div class='quote'>.contents = [<span class='text'>, <span class='tags'>,..]
# .descendants
<div class='quote'>.descendants = [<span class='text'>,"the world we have created" ,
<span class= 'tags'>, <a href=> ....]
# .string
<span class='text'>.strings = [" the world we have created.. "]
# .stripped_Strings
<span class='text'>.string = ["the world we have created.."]
# .parent
<a href='/tag/deep-thoughts/'>.parent = <span class='tags'>...</span>
# .next_sibling
<span class='text'>.next_sibling = <span class='tags'>...</span>
# .previous_sibling
<span class='text'>.previous_sibling = <div class='quote'>...</div>
# .next_element
<a href='/tag/deep-thoughts/'>.next_element = "deep-thoughts"
# .previous_element
<a href='/tag/deep-thoughts/'>.next_element = <span class='tags'>...</span>

Try it yourself

Explore some of the above attributes using the editor below and Quotes to Scrape:

Press + to interact
import requests
from bs4 import BeautifulSoup
response = requests.get("https://quotes.toscrape.com/")
soup = BeautifulSoup(response.content, 'html.parser')
first_quote_element = soup.find("div", {"class":"quote"})
print(type(first_quote_element))

Searching the DOM

In Beautiful Soup, find_all() is a method that searches the entire parse tree of an HTML or XML document and returns a list of all the matching elements. It is a compelling method that can be used to search for any element in the document based on its tag name, attributes, values, and other criteria. The method find() returns the first element of the provided tag, while the find all() method returns all elements of a given tag.

Let’s scrape the data from the Quotes to Scrape website using the find_all method:

Press + to interact
import requests
from bs4 import BeautifulSoup
response = requests.get("https://quotes.toscrape.com/")
soup = BeautifulSoup(response.content, 'html.parser')
# returns all the elements with class="quote"
all_quotes_div_elements = soup.find_all("div", {"class":"quote"})
quotes = []
for div in all_quotes_div_elements:
# find() will always return the first match
text_span = div.find("span", {"class":"text"})
quotes.append(text_span.string)
print(quotes[:5])
  • Line 8: We first search for all the <div> elements of the quote's information.

  • Lines 11–14: Then we iterate through all of them and search for the <span> tag that holds the quote's text for each one. We then extract it using .string attribute.

Try it yourself

Try doing the same in the code below. Scrape all the authors' names from the first page.

Press + to interact
import requests
from bs4 import BeautifulSoup
response = requests.get("https://quotes.toscrape.com/")
soup = BeautifulSoup(response.content, 'html.parser')
all_quotes_div_elements = soup.find_all("div", {"class":"quote"})
# don't remove the list, just append to it
authors = []
# TODO
# Ignore the "by" word just append the string of the tag
print(set(authors))

We have successfully retrieved information from the first page, but our goal is to scrape the entire site. To accomplish this, we need to iterate through all the page URLs and retrieve the quotes from each one.

Press + to interact
import requests
from bs4 import BeautifulSoup
# maintain the main URL to use when joining page url
base_url = "https://quotes.toscrape.com"
all_quotes = []
def get_quotes(soup):
"""
retrieve the quotes from the soup of the current page
"""
all_quotes_div_elements = soup.find_all("div", {"class":"quote"})
quotes = []
for div in all_quotes_div_elements:
text_span = div.find("span", {"class":"text"})
quotes.append(text_span.string)
return quotes
def scrape(url):
"""
request on the URL, get the quotes, find the next page info, recurse.
"""
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
all_quotes.extend(get_quotes(soup))
# we got this info after inspecting the next button
next_page = soup.find("ul", {"class":"pager"}).find("li", {"class":"next"})
# check if we reached the last page or not.
if next_page:
# join the main url with the page sub url
# ex: "https://quotes.toscrape.com" + "/page/2/"
next_page_url = requests.compat.urljoin(base_url, next_page.a['href'])
scrape(next_page_url)
return
scrape(base_url)
print("Total quotes scraped: ", len(all_quotes))
print(all_quotes[:5])
  • Lines 7–18: In these lines, we define a function get_quotes() that takes the soup object and scrapes all the quotes text using the code we built above.

  • Line 29: We then inspect the next page element and get that element by specifying its path from the DOM.

  • Line 31: The last page won't have a next page element, so we check if the next_page variable holds something or has a NONE value.

  • Line 34: We extract the following page URL from the element. However, the URL doesn't contain the domain name, so we use requests.compat.urljoin() function, which joins two URLs together.

  • Line 35: Lastly, we call the scrape() function with the following page URL and do the whole process again until we reach the last page from the site.

There is an easier way to do the task above. Using a simple for-loop, we can get a list of all the page URLs and then request each one. However, implementing the earlier method helps us understand different approaches that can be useful in more complex scenarios.

Try it yourself

The Quotes to Scrape website displays the top ten tags on the right side. Can you scrape all the URLs for these tags?

Press + to interact
import requests
from requests.compat import urljoin
from bs4 import BeautifulSoup
base_url = "https://quotes.toscrape.com/"
response = requests.get(base_url)
soup = BeautifulSoup(response.content, 'html.parser')
# don't remove the list, just append to it
top_ten_tags_URLs = []
# TODO
# don't forget to join the urls with the base_url
print(top_ten_tags_URLs)

Other useful functions

Some other functions can be used in more complex scenarios as follows:

  • find_parent() / find_parents()

  • find_next_sibling() / find_next_siblings()

  • find_previous_sibling() / find_previous_siblings()

  • find_next() / find_all_next()

  • find_previous() / find_all_previous()

Press + to interact
import requests
from bs4 import BeautifulSoup
response = requests.get("https://quotes.toscrape.com/")
soup = BeautifulSoup(response.content, 'html.parser')
# retrieve the "Top ten tags" text
first_tag = soup.find("span", {"class":"tag-item"})
print(first_tag.find_previous_sibling().string)
# extract the "by" word + author name in first quote
first_quote_span = soup.find("div", {"class":"quote"}).find("span", {"class":"text"})
by_word = first_quote_span.find_next_sibling().find_next(string=True)
author_name = soup.find("small", {"class":"author"}).string
print(by_word + author_name)
  • Lines 8–9: We want to extract the "Top ten tags" text. One way to do it is to get the first tag item Love and then find its previous sibling using the function find_previous_sibling(), which will return the <h2> tag that holds the text.

  • Lines 12–15: We want to extract the author's name but with the "by" word.

    • Line 12: First, we get the quote <span class='text'> element by following its path starting from the <div class='quote'>.

    • Line 13: The "by" word is the string that immediately follows the <span> element, and this <span> element is the next sibling to <span class='text'>. Thus, we will get the next sibling of the text span and then get its next sibling using find_next_sibling(). Lastly, we use find_next() to return the next element and pass string=True to include strings as the following elements.

The above example may be more than necessary for the specific use case, but it demonstrates the utilization of these functions to extract any desired information from the page.

Conclusion

This lesson covered searching and navigating the DOM structure and scraping website information. With this knowledge, it is possible to retrieve the desired data from any website by making appropriate requests and utilizing the functions provided by the Beautiful Soup library.