Web scraping with Beautiful Soup

One of the most popular tools for web scraping is Beautiful Soup, a Python library that provides an intuitive way to parse HTML and XML documents.

The following steps demonstrate the process of web scraping with Beautiful Soup.

Import libraries

First, we import the necessary libraries for web scraping.

from urllib.request import urlopen
import pandas as pd
from bs4 import BeautifulSoup

Create a Beautiful Soup object

web = "https://en.wikipedia.org/wiki/The_World%27s_Billionaires"
html = urlopen(web)
soup_obj = BeautifulSoup(html, features="html.parser")

We use urlopen to open the webpage and create a BeautifulSoup object to parse the HTML content.

In this Answer, we will be using a Wikipedia page to scrape some data.

Retrieve the desired content

table = soup_obj.find("table", class_="wikitable sortable")
extract = table.find("tbody")

We use the find() method to locate the table containing the desired data within the HTML structure.

Scrape data from rows

count = 0
data = list()
rows = extract.find_all("tr")
for row in rows:
if count == 0:
column = row.find_all("th")
count = count + 1
else:
column = row.find_all("td")
column = [element.text.strip() for element in column]
data.append([element for element in column if element])

Using a loop, we iterate through each row of the table. In the first iteration, we extract the column names (th tags) and store them in the column variable. We extract the cell values (td tags) for subsequent iterations and store them in the column.

Create a DataFrame

frame = pd.DataFrame(columns=data[0])
for i in range(1, len(data)):
frame = frame.append(pd.Series(data[i], index=data[0]), ignore_index=True)

Using the scraped data, we create a pandas DataFrame. The first row of data contains the column names, which we pass as the columns argument while constructing the DataFrame. We then iterate through the remaining rows, adding them to the DataFrame using the append() method.

Display result

frame = frame.drop("Primary source(s) of wealth", axis=1)
print("The constructed DataFrame is \n")
print(frame.to_string())

Finally, we print the constructed DataFrame containing the scraped data after removing the "Primary source(s) of wealth" column from the DataFrame using the drop method and specifying axis=1 to drop a column.

Code

The Python code that scrapes the data and then stores it in a DataFrame.

from urllib.request import urlopen
import pandas as pd
from bs4 import BeautifulSoup
web = "https://en.wikipedia.org/wiki/The_World%27s_Billionaires"
html = urlopen(web)
soup_obj = BeautifulSoup(html, features="html.parser")
table = soup_obj.find("table", class_="wikitable sortable")
extract = table.find("tbody")
count = 0
data = list()
rows = extract.find_all("tr")
for row in rows:
if count == 0:
column = row.find_all("th")
count = count + 1
else:
column = row.find_all("td")
column = [element.text.strip() for element in column]
data.append([element for element in column if element])
frame = pd.DataFrame(columns=data[0])
for i in range(1, len(data)):
frame = frame.append(pd.Series(data[i], index=data[0]), ignore_index=True)
frame = frame.drop("Primary source(s) of wealth", axis=1)
print("The constructed DataFrame is \n")
print(frame.to_string())

By utilizing the web scraping with Beautiful Soup and the data manipulation capabilities of pandas, we have extracted "The World's Billionaires" data from the corresponding Wikipedia page. The resulting DataFrame provides a structured representation of the scraped data, enabling further analysis, and visualization.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved