Extracting Data with Web Scraping
Learn how to extract online data from websites all over the internet.
We'll cover the following
Introduction to web scraping
Web scraping is a method for extracting data from web pages. With web scraping, we can extract data in HTML, XML, or JSON format from webpages, parse it, and extract the relevant data. We can create scripts to automatically retrieve and parse data from web pages according to a schedule and extract online data, such as comments from a forum or a social media platform or the latest price of product items from Amazon.
Web scraping can also be used as a one time process to extract relevant data. It has a wide range of applications, including data mining, data analysis, online market research, and more. It’s a useful tool for extracting data from websites that do not provide an API.
Python is an awesome tool for web scraping. It has two great libraries for web scraping called requests
and Beautifulsoup
.
The requests
library
The requests
library lets us send HTTP requests to websites and handle the response. The most common type of request for our purpose is a get
request. We use a get
request to retrieve information from a server or a service. If the request is successful, the server will return a response, which is the data we requested, usually in HTML/JSON format.
The Beautifulsoup
library
After creating a successful request and fetching the response using the requests
library, we can use the Beatifulsoup
library to parse and navigate the returned content.
Response codes
After sending a get
request to a server, a response code will be returned. These codes communicate the status of the HTTP request. The response codes are grouped into classes and codes of the same range.
Response codes between 200 and 299 indicate success, codes between 300 and 399 indicate redirection, codes between 400 and 499 indicate a client-side error, and codes between 500 and 599 indicate a server-side error.
Some common responses are:
200 OK: The request was successful, and the server has returned the requested response.
302 Found: The requested resource has been temporarily moved to a new location.
400 Bad Request: The request was invalid or malformed.
404 Not Found: The requested resource could not be found on the server.
500 Internal Server Error: An unexpected error occurred on the server while processing the request.
Tutorial
Let's go over a quick tutorial on web scraping using Python. We’ll web scrape the following example website:
Get hands-on with 1400+ tech skills courses.