The internet is arguably the most abundant data source that you can access today. Crawling through this massive web of information on your own would take a superhuman amount of effort. So, why not build a web scraper to do the detective work for you? Automated web scraping is a great way to collect relevant data across many webpages in a relatively short amount of time.
You may be wondering why we chose Python for this tutorial, and the short answer is that Python is considered one of the best programming languages to use for web scraping. Python libraries like BeautifulSoup and packages like Selenium have made it incredibly easy to get started with your own web scraping project.
We’ll introduce you to some basic principles and applications of web scraping. Then, we’ll take a closer look at some of the more popular Python tools and libraries used for web scraping before moving on to a quick step-by-step tutorial for building your very own web scraper. Let’s get started!
We’ll cover:
Try one of our 300+ courses and learning paths: Predictive Data Analysis with Python.
As a high-level, interpreted language, Python 3 is one of the easiest languages to read and write because its syntax bears some similarities to the English language. Luckily for us, Python is much easier to learn than English. Python programming is also a great choice in general for anyone who wants to dabble in data sciences, artificial intelligence, machine learning, web applications, image processing, or operating systems.
This section will cover what Python web scraping is, what it can be used for, how it works, and the tools you can use to scrape data.
Web scraping is the process of extracting usable data from different webpages to be used for analysis, comparison, and many other purposes. The type of data that can be collected ranges from text, images, ratings, URLs, and more. Web scrapers extract this data by loading a URL and loading the HTML code for that page. Advanced web scrapers are capable of extracting CSS and JavaScript code from the webpage as well.
Believe it or not, web scraping used to be conducted manually by copying and pasting data from webpages into text files and spreadsheets!
Legality
As long as the data you’re scraping does not require an account for access, isn’t blocked by a robots.txt
file, and is publicly available, it’s considered fair game.
What’s the difference between a web crawler and a web scraper?
A web crawler just collects data (usually to archive or index), while web scrapers look for specific types of data to collect, analyze, and transform.
Web scraping has a wide variety of applications. Web developers, digital marketers, data scientists, and journalists regularly use web scraping to collect publicly available data. This data can be transferred to a spreadsheet or JSON file for easy data analysis, or it can be used to create an application programming interface (API). Web scraping is also great for building bots, automating complicated searches, and tracking the prices of goods and services.
Here are some other real-world applications of web scraping:
Web scraping involves three steps:
These are some of the most popular tools and libraries used to scrape the web using Python.
However, for the purposes of this tutorial, we’ll be focusing on just three: Beautiful Soup 4 (BS4), Selenium, and the statistics.py module.
Let’s say we want to compare the prices of women’s jeans on Madewell and NET-A-PORTER to see who has the better price.
For this tutorial, we’ll build a web scraper to help us compare the average prices of products offered by two similar online fashion retailers.
For both Madewell and NET-A-PORTER, you’ll want to grab the target URL from their webpage for women’s jeans.
For Madewell, this URL is:
https://www.madewell.com/womens/clothing/jeans
For NET-A-PORTER, your URL will be:
https://www.net-a-porter.com/en-us/
Once you’ve selected your URLs, you’ll want to figure out what HTML tags or attributes your desired data will be located under. For this step, you’ll want to inspect the source of your webpage (or open the Developer Tools Panel).
You can do this with a right-click on the page you’re on, and selecting Inspect from the drop-down menu.
Google Chrome Shortcut:
Ctrl + Shift + C
for Windows orCommand + Shift + C
for MacOS will let you view the HTML code for this step
In this case, we’re looking for the price of jeans. If you look through the HTML document, you’ll notice that this information is available under the <span>
tag for both Madewell and NET-A-PORTER.
However, using the <span>
tag would retrieve too much irrelevant data because it’s too generic. We want to narrow down our target when data scraping, and we can get more specific by using attributes inside of the <span>
tag instead.
For Madewell, a better HTML attribute would be:
product-sales-price product-usd
For NET-A-PORTER, we’d want to narrow down our target with:
itemprop
For this task, we will be using the Selenium and Beautiful Soup 4 (BS4) libraries in addition to the statistics.py module. Here’s a quick breakdown of why we chose these web scraping tools:
Selenium can automatically open a web browser and run tasks in it using a simple script. The Selenium library requires a web browser’s driver to be accessed, so we decided to use Google Chrome and downloaded its drivers from here: ChromeDriver Downloads
We’re using BS4 with Python’s built-in HTML parser because it’s simple and beginner-friendly. A BS4 object gives us access to tools that can scrape any given website through its tags and attributes.
Scrapy is another Python library that would have been suitable for this task, but it’s a little more complex than BS4.
The statistics.py
module contains methods for calculating mathematical statistics of numeric data.
Method | Description |
---|---|
statistics.mean() |
Calculates the mean (average) of the given data |
First, you’ll want to import statistics
, requests
, webdriver
from selenium
, and the beautifulsoup
library.
from bs4 import BeautifulSoup
from selenium import webdriver
import requests
import statistics
PATH = r'/usercode/chromedriver'
driver = webdriver.Chrome(PATH)
driver.get("https://www.net-a-porter.com/en-us/shop/clothing/jeans")
Here, we create a beautifulsoup
object with the HTML source as driver.page_source
and Python’s built-in HTML parser, html.parser
, as arguments.
This starts the web scraper search for specific tags and attributes.
soup = BeautifulSoup(driver.page_source),
'html.parser')
response = soup.find_all("span", {"itemprop" : "price"})
data = []
for item in response:
data.append(float(item.text.strip("\n$)))
print(data)
print(statistics.mean(extracted_data1))
print(statistics.mean(extracted_data2))
Try one of our 300+ courses and learning paths: Predictive Data Analysis with Python.
from bs4 import BeautifulSoup
from selenium import webdriver
import statistics
def shop1():
PATH = r'/usercode/chromedriver'
driver = webdriver.Chrome(PATH)
driver.get("https://www.net-a-porter.com/en-us/shop/clothing/jeans")
soup = BeautifulSoup(driver.page_source, 'html.parser')
response = soup.find_all("span", {"itemprop" : "price"})
data = []
for item in response:
data.append(float(item.text.strip('$')))
print(data)
return data
def shop2():
PATH = r'usercode/chromedriver'
driver = webdriver.Chrome(PATH)
driver.get("https://www.madewell.com/womens/clothing/jeans")
soup = BeautifulSoup(driver.page_source, 'html.parser')
response = soup.find_all("span", "product-sales-price product-usd")
data = []
for item in response:
data.append(float(item.text.strip("\n$")))
print(data)
return data
extracted_data1 = shop1()
extracted_data2 = shop2()
print(statistics.mean(extracted_data1))
print(statistics.mean(extracted_data2))
Using the above code, you can repeat the steps for Madewell. As a quick reminder, here are the basic steps you’ll need to follow:
driver.get
method of the driver objectdriver.pagesource
method, and Python’s built-in HTML parser, html.parser
as argumentsfind_all
method to extract the data in the tags you identified into a list.text
and .strip()
Congratulations! You’ve built your first web scraper with Python. By now, you might have a better idea of just how useful web scraping can be, and we encourage you to keep learning more about Python if you want to develop the skills to create your own APIs.
You might not master Python in a single day, but hopefully, this tutorial has helped you realize that Python is much more approachable than you might expect.
To help you master Python, we’ve created the Predictive Data Analysis with Python course.
Happy learning!
Free Resources