Read Data from HTML Files

Learn how to read data from the HTML file format.

Markup language files

A markup language is a computer language that separates document elements by tags so there is a clear structure for dividing information into sections. Unlike programming languages, markup languages are human-readable and can be opened with most text editors. While there are numerous types of markup languages, we’ll cover the two most popular ones—HTML and XML.

HTML file format

HTML stands for HyperText Markup Language and is the standard markup language for creating webpages. The web page’s structure is described by the elements in the HTML file so that the browser can correctly display the contents.

Press + to interact
<!DOCTYPE html>
<html>
<head>
<title>Welcome to Educative</title>
</head>
<body>
<h1>Advanced Pandas - Going Beyond the Basics</h1>
<p>You are currently on Chapter 2 of the course</p>
</body>
</html>

Read from HTML files

The read_html() reads HTML tables by searching for <table> HTML tags before returning the contents as a list of pandas DataFrames. We can similarly use this function for local HTML files.

For example, say we have a local HTML file saved from Wikipedia called continents.html containing tabular data of area and population estimates of seven continents. By using read_html(), we can transcribe the HTML table data into pandas DataFrames. Because the output for this example is a list with only one element (the continents table), we can directly retrieve the table we want by accessing index 0 of the list as shown below:

Press + to interact
# Define path to HTML file
html_path = '../usr/local/data/html/continents.html'
# Retrieve first element from list of HTML tables
continents_df = pd.read_html(html_path)[0]
# Display table contents as HTML
print(continents_df.to_html())

Note: We can expect to do some manual data cleanup after using the read_html() function, such as assigning column names, converting column data types, etc.

If we know that a table has specific HTML attributes, we can use the attrs parameter to retrieve it specifically. For example, the contents of a table with class name wikitable can be read with the following code:

Press + to interact
# Specify attributes of table and index of table to retrieve
df = pd.read_html(html_path, attrs = {'class': 'wikitable'})[0]