Search⌘ K
AI Features

Narrowing in on the Data

Understand how to filter and extract meaningful data from HTML tables during web scraping in Python. This lesson helps you identify relevant table rows, handle blank cells and unusual characters, and organize the scraped data into structured lists for further analysis.

We'll cover the following...

Getting exact rows

You are honing in on the information that you want. Let’s take a look at where you are:

Python 3.8
import requests
from bs4 import BeautifulSoup
def scrape_website(address: str) -> str:
"""
Scrape the properties website and return the response text
:param address: URL of website to scrape
:return: str as response.text
"""
headers = {'user-agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:74.0) Gecko/20100101 Firefox/74.0"}
r = requests.get(address, headers=headers)
return r.text
url = "https://www.engineeringtoolbox.com/properties-aluminum-pipe-d_1340.html"
website_text = scrape_website(url)
soup = BeautifulSoup(website_text, 'lxml')
table = soup.find('table', class_="large tablesorter")
for row in table:
for index, tr in enumerate(row):
print(len(tr), tr)

Since you can see the <tr> tag first (which means table row), that means that you can now start honing in on how we can filter out the blanks. The next step is to print the length of each tr:

table = soup.find('table', class_="large tablesorter")
for row in table:
    for tr in row:
        print(len(tr), tr)

Again, looking at the ...