Narrowing in on the Data

Understand how to filter and extract meaningful data from HTML tables during web scraping in Python. This lesson helps you identify relevant table rows, handle blank cells and unusual characters, and organize the scraped data into structured lists for further analysis.

We'll cover the following...

- Getting exact rows
- Finishing the webscraping

Python 3.8

import requests
from bs4 import BeautifulSoup
def scrape_website(address: str) -> str:
    """
    Scrape the properties website and return the response text
    :param address: URL of website to scrape
    :return: str as response.text
    """
    headers = {'user-agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:74.0) Gecko/20100101 Firefox/74.0"}
    r = requests.get(address, headers=headers)
    return r.text
url = "https://www.engineeringtoolbox.com/properties-aluminum-pipe-d_1340.html"
website_text = scrape_website(url)
soup = BeautifulSoup(website_text, 'lxml')
table = soup.find('table', class_="large tablesorter")
for row in table:
    for index, tr in enumerate(row):
        print(len(tr), tr)

Since you can see the <tr> tag first (which means table row), that means that you can now start honing in on how we can filter out the blanks. The next step is to print the length of each tr:

table = soup.find('table', class_="large tablesorter")
for row in table:
    for tr in row:
        print(len(tr), tr)

Again, looking at the ...

1.Getting Comfortable with Python

2.FizzBuzz

3.Graphing Thrust Available and Thrust Required

4.Graphing Dynamic Pressure During a Rocket Launch

5.Getting and Plotting Airfoil Coordinates

6.Modeling a 2-Body Orbit in 2D and 3D

7.Unit Conversions

8.Introduction to Web Scraping

9.Modeling Camera Shutter Effect

10.Writing Reports with Pweave

11.The End

Narrowing in on the Data

Getting exact rows