Home/Blog/Data Science/What Is Web Scraping and How We Can Use It
Home/Blog/Data Science/What Is Web Scraping and How We Can Use It

What Is Web Scraping and How We Can Use It

Awais Qasim
Jan 22, 2024
7 min read

Become a Software Engineer in Months, Not Years

From your first line of code, to your first day on the job — Educative has you covered. Join 2M+ developers learning in-demand programming skills.

We often come across a website containing data of interest to us. However, the data is so much that manually extracting it might be too tedious and error-prone. This is why you need to understand what web scraping is. Web scraping refers to the automatic extraction of data from websites. It is also sometimes referred to as web harvesting. For this to be performed, we use some sort of language/tool that extracts data from web pages in a structured way. We can then analyze this data as per our needs.

Web scraping
Web scraping

How Does It Work?#


Usually, we send multiple HTTP requests to the website we are interested in and then receive the HTML content of the website. This content is then parsed, throwing away irrelevant/unnecessary content and keeping only the filtered data. It is to be noted that the data can be in the form of text or visuals (images/videos). This process can be done either in a semi-automated way where we copy the data from the website ourselves, or automated, in which we use tools and configure data extraction.

Issues in Web Scraping#

If a website has not enforced an automated bot blockage mechanism like captchas, then it is easy to copy content from the website using automated tools. The outcome is also influenced by the specific kind of captcha implemented on a website, ranging from text-entry and image-based captchas to audio, puzzle, button, and even invisible captchas. Nevertheless, several websites now offer solutions to decode these captchas on our behalf, such as 2Captcha“2Captcha: Captcha Solving Service, ReCAPTCHA Recognition and Bypass, Fast Auto Anti Captcha.” n.d. 2captcha.com. Accessed November 2, 2023. https://2captcha.com/. and Anti-CAPTCHA“Anti Captcha: Captcha Solving Service. Bypass Recaptcha, FunCaptcha Arkose Labs, Image Captcha, GeeTest, HCaptcha.” n.d. Anti-Captcha.com. https://anti-captcha.com/., which usually require a fee. Alternatively, if we aim to avoid these charges, machine learning methods can be employed to tackle text and image-based captchas.

The Legality of Web Scraping#

In general, scraping a website is not illegal. However, challenges emerge when we retrieve information from a website that was not intended for public exposure. As a general guideline, data present on a website without the need for login credentials can typically be extracted through scraping without encountering significant problems. Similarly, if a website has deployed software that restricts the use of web scrapers, then we should avoid it.

How Do Web Scrapers Work?#

A multitude of diverse web scrapers are available, each equipped with its distinct array of functions. Here is a broad outline of how a typical web scraper functions:

  1. HTTP requests: The web scraper commences by sending an HTTP request to a designated URL, with the objective of retrieving the web page’s content. This procedure mirrors the way a web browser fetches a web page.

  2. Acquiring HTML: The server hosting the website responds to the request by transmitting the HTML content of the web page. This HTML code encompasses all components like text, images, links, and other elements constituting the web page.

  3. HTML parsing: Subsequently, the web scraper engages in HTML parsing, a process of analyzing and interpreting the HTML content to locate sections of the web page containing the desired data. This entails utilizing tools like HTML parsing libraries to navigate the structural aspects of the HTML code.

  4. Data extraction: Once the pertinent segments of the HTML are pinpointed, the scraper proceeds to extract the targeted data. This might involve a range of content categories, including text, images, links, tables, or any other relevant information found on the web page.

  5. Data cleansing: Depending on the quality of the HTML code and the page’s structure, the extracted data might necessitate cleaning and formatting. This phase involves eliminating extraneous tags and special characters, ensuring that the data is formatted in a usable manner.

  6. Data storage: After the cleansing phase, the cleaned data can be organized into a structured format. This could involve storing the data in mediums like CSV files, databases, or other storage solutions aligning with the intended purpose.

  7. Iterating through pages: In cases where the scraper needs to accumulate data from multiple pages (such as scraping search results), it iterates through the process by sending requests to distinct URLs, extracting data from each individual page.

  8. Handling dynamic content: Websites employing JavaScript to load content dynamically subsequent to the initial HTML retrieval necessitate more sophisticated scraping techniques. This involves utilizing tools like a headless browser or resources like Selenium to interact with the page as a user would, thereby extracting dynamically loaded content.

  9. Observing robots.txt: The web scraper must adhere to the instructions outlined in a website’s robots.txt file, which delineates the permissible and restricted sections for scraping. Adhering to these directives is pivotal in avoiding legal and ethical dilemmas.

  10. Rate limiting: To avert overwhelming a website’s server with an excessive number of requests in a short span, the scraper might integrate rate-limiting mechanisms. These mechanisms are designed to ensure responsible and restrained scraping.

Common steps of web scraping
Common steps of web scraping

It’s important to understand that web scraping must be carried out conscientiously and ethically. Prior to initiating scraping activities on a website, it is advisable to carefully review the website’s terms of use. This practice ensures compliance with scraping regulations and provides insights into any constraints or recommendations stipulated by the website’s administrators.

How to Scrap a Website Using Python#

Let’s now learn how we can use Python to scrape a website. For this, we will use this blog about GraphQL benefits and applications as an example.

Many modern websites feature intricate HTML structures. Thankfully, the majority of web browsers offer tools that help us decipher these complexities in website elements. For example, when we open the blog through Chrome, we can right-click any of the blog titles. Then, we can opt for the “Inspect” choice from the menu (illustrated below):

Right–clicking on a blog that we want to scrap
Right–clicking on a blog that we want to scrap

After clicking “Inspect,” we will see a sidebar showing the HTML tag that contains that text.

Finding out the tag of the text we want to parse
Finding out the tag of the text we want to parse

A variety of web scrapers are accessible in Python and other programming languages as well. However, for this blog, we’ll utilize the widely renowned web scraper called Beautiful Soup. We can set it up by executing the below command:

pip3 install beautifulsoup4

Retrieving H1 Headings From a Website#

Let’s write code for retrieving all H1 headings from our blog.

import requests
from bs4 import BeautifulSoup
url = 'https://www.educative.io/blog/get-started-with-python-debuggers'
def get_data():
req = requests.get(url)
html = req.text
soup = BeautifulSoup(html, 'html.parser')
data_stream = soup.findAll('h1')
for data_chunk in data_stream:
print(data_chunk)
print("\n")
return data_stream
if __name__ == '__main__':
data = get_data()
Retrieving all the h1 headings

If we execute the code above, we will see the following response.

Printing the h1 headings received from the blog
Printing the h1 headings received from the blog

Let’s now review the code we have written.

Lines 1–2: We import the scraper we will be using, i.e., BeautifulSoup.

Line 4: We specify the URL of the blog that we will use for scraping.

Line 6: We define the get_data() method.

Line 8: We make a req object using the blog URL.

Lines 9–10: We specify the HTML parser, which in our case, is html.parser. It is included with Python. Kindly note that we can use any other parser too.

Line 11: We specify the tag that we want to receive from the website, i.e., h1.

Lines 13–15: We print all the data we receive from the website for the mentioned tag.

Line 17: We return the received data.

Now, let’s change the code to retrieve all the h2 headings. In the code widget above, in line 11, let’s replace the tag h1 with h2, as shown below.

data_stream = soup.findAll('h2')
Retrieving all the h2 headings

Now, if we execute the code above, we will see all the h2 headings being printed.

Printing the h2 headings received from the blog
Printing the h2 headings received from the blog

Finally, let’s write code to retrieve all the paragraphs in the blog. If we use the “Inspect” option as mentioned above, we will see that it is wrapped in the <p> tag. This time, in the code widget above, in line 11, we will use the tag p, as shown below.

data_stream = soup.findAll('p')
Retrieving all the paragraphs

After executing the code above, we will see all the paragraphs of the blog.

Printing the paragraphs received from the blog
Printing the paragraphs received from the blog

This wraps up our blog about web scraping and how to use it. We initially started with the description of web scraping and how it can be beneficial. Then, we discussed the legal issues that might arise. Then, we discussed how web scrapers work in general. After that, we practically implemented a working web scraper in Python. Note that there are many tools available too, but having an in-depth knowledge of how web scrapers work in general is always helpful.

Your Next Learning Steps#

To deepen your understanding of what web scraping is, we strongly recommend you look through this selection of specialized courses on the Educative platform.

Don't pass up this chance to increase your knowledge and expertise in web scraping. Take the first step toward becoming a web scraping expert by immediately enrolling in Educative courses!

Frequently Asked Questions

What is web scraping used for?

Web scraping is primarily used for applications such as monitoring prices, gathering price intelligence, tracking news, generating leads, and conducting market research. Generally, it’s used by individuals and businesses seeking publicly accessible web data to derive insightful information and inform more strategic decision-making.


  

Free Resources