Essentials of HTML

Discover HTML concepts and features.

Mastering HTML is a requirement for pursuing a profession in web development. However, we will look at the basics required to achieve our goal of web scraping.

Introduction

HyperText Markup Language (HTML) is a standard markup language used to create web pages. It explains the structure of the web pages using tags. These tags inform the browser on how to display the information, such as a title, heading, image, link, or any other type.

Markup languages vs. programming languages

Here are some common differences between markup and programming languages:

Markup Language

Programming Language

Primarily used for defining and describing content

Primarily used for writing executable programs

Does not require compilation or execution

Requires compilation and/or execution to produce output

Used to format text, images, and multimedia content

Used to build applications, software, and systems

Examples: HTML, XML, Markdown

Examples: Python, Java, C++, JavaScript

Structure

Each HTML document can be considered a document tree. We define the tree's components in the same way that we would describe a family tree, with each node in the tree being an HTML tag that can contain other tags as children.

Press + to interact
HTML document tree
HTML document tree

The browser’s job is to understand each element’s purpose and display it correctly. This tree structure is called Document Object Model (DOM), which treats an HTML doc as a tree. Once we have the DOM, we can easily search and retrieve any element (node) we want. We will not do the conversion or write the search algorithms ourselves; Python libraries and tools will do the job for us. All we need to have is the path for the tag.

Press + to interact
main.py
page.html
from bs4 import BeautifulSoup
# reading the page content
with open('page.html', 'r') as f:
html_doc = f.read()
# Parse using BeautifulSoup
tree = BeautifulSoup(html_doc, 'html.parser')
# print the p element content
print(tree.body.div.p.text)
  • Lines 4–5: We start by reading the HTML document.

  • Lines 7–9: We parse the document and convert it to a tree structure using a Python library called BeautifulSoup, which we will discuss later. Once we have the tree, we can navigate it and reach any element using its path from the root. ...

Tags