Fetching Text from a Website
Develop the web scraping and data saving program.
We'll cover the following...
These are the libraries that we will be using.
import requests
from bs4 import BeautifulSoup
import openpyxl
Getting text
The first function you want to create will accept a website address (URL) as an argument and return the text (the code as str
) of the website. This is a good, neutral function that can be included in any other program you write because it is agnostic to the URL.
def scrape_website(address: str) -> str:"""Scrape the properties website and return the response text:param address: URL of website to scrape:return: str as response.text"""headers = {'user-agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:74.0) Gecko/20100101 Firefox/74.0"}r = requests.get(address, headers=headers)return r.text
Headers are what your browser sends along with its request to access a webpage. The user-agent defines what type of computer is making the request. Because requests without a user-agent are very obviously robots, it can be good practice to include your normal user-agent to show that you mean no harm. The easiest way to find your browser’s user-agent is to type “What is my user agent?” into a search engine. Requests
will still work if you just get()
the URL, but ...