One of the most popular tools for web scraping is Beautiful Soup, a Python library that provides an intuitive way to parse HTML and XML documents.
The following steps demonstrate the process of web scraping with Beautiful Soup.
First, we import the necessary libraries for web scraping.
from urllib.request import urlopenimport pandas as pdfrom bs4 import BeautifulSoup
web = "https://en.wikipedia.org/wiki/The_World%27s_Billionaires"html = urlopen(web)soup_obj = BeautifulSoup(html, features="html.parser")
We use urlopen
to open the webpage and create a BeautifulSoup
object to parse the HTML content.
In this Answer, we will be using a Wikipedia page to scrape some data.
table = soup_obj.find("table", class_="wikitable sortable")extract = table.find("tbody")
We use the find()
method to locate the table containing the desired data within the HTML structure.
count = 0data = list()rows = extract.find_all("tr")for row in rows:if count == 0:column = row.find_all("th")count = count + 1else:column = row.find_all("td")column = [element.text.strip() for element in column]data.append([element for element in column if element])
Using a loop, we iterate through each row of the table. In the first iteration, we extract the column names (th
tags) and store them in the column
variable. We extract the cell values (td
tags) for subsequent iterations and store them in the column
.
frame = pd.DataFrame(columns=data[0])for i in range(1, len(data)):frame = frame.append(pd.Series(data[i], index=data[0]), ignore_index=True)
Using the scraped data, we create a pandas DataFrame. The first row of data
contains the column names, which we pass as the columns
argument while constructing the DataFrame. We then iterate through the remaining rows, adding them to the DataFrame using the append()
method.
frame = frame.drop("Primary source(s) of wealth", axis=1)print("The constructed DataFrame is \n")print(frame.to_string())
Finally, we print the constructed DataFrame containing the scraped data after removing the "Primary source(s) of wealth" column from the DataFrame using the drop
method and specifying axis=1
to drop a column.
The Python code that scrapes the data and then stores it in a DataFrame.
from urllib.request import urlopenimport pandas as pdfrom bs4 import BeautifulSoupweb = "https://en.wikipedia.org/wiki/The_World%27s_Billionaires"html = urlopen(web)soup_obj = BeautifulSoup(html, features="html.parser")table = soup_obj.find("table", class_="wikitable sortable")extract = table.find("tbody")count = 0data = list()rows = extract.find_all("tr")for row in rows:if count == 0:column = row.find_all("th")count = count + 1else:column = row.find_all("td")column = [element.text.strip() for element in column]data.append([element for element in column if element])frame = pd.DataFrame(columns=data[0])for i in range(1, len(data)):frame = frame.append(pd.Series(data[i], index=data[0]), ignore_index=True)frame = frame.drop("Primary source(s) of wealth", axis=1)print("The constructed DataFrame is \n")print(frame.to_string())
By utilizing the web scraping with Beautiful Soup and the data manipulation capabilities of pandas, we have extracted "The World's Billionaires" data from the corresponding Wikipedia page. The resulting DataFrame provides a structured representation of the scraped data, enabling further analysis, and visualization.
Free Resources