Essentials of HTML
Discover HTML concepts and features.
We'll cover the following...
Mastering HTML is a requirement for pursuing a profession in web development. However, we will look at the basics required to achieve our goal of web scraping.
Introduction
HyperText Markup Language (HTML) is a standard markup language used to create web pages. It explains the structure of the web pages using tags. These tags inform the browser on how to display the information, such as a title, heading, image, link, or any other type.
Markup languages vs. programming languages
Here are some common differences between markup and programming languages:
Markup Language | Programming Language |
Primarily used for defining and describing content | Primarily used for writing executable programs |
Does not require compilation or execution | Requires compilation and/or execution to produce output |
Used to format text, images, and multimedia content | Used to build applications, software, and systems |
Examples: HTML, XML, Markdown | Examples: Python, Java, C++, JavaScript |
Structure
Each HTML document can be considered a document tree. We define the tree's components in the same way that we would describe a family tree, with each node in the tree being an HTML tag that can contain other tags as children.
The browser’s job is to understand each element’s purpose and display it correctly. This tree structure is called Document Object Model (DOM), which treats an HTML doc as a tree. Once we have the DOM, we can easily search and retrieve any element (node) we want. We will not do the conversion or write the search algorithms ourselves; Python libraries and tools will do the job for us. All we need to have is the path for the tag.
from bs4 import BeautifulSoup# reading the page contentwith open('page.html', 'r') as f:html_doc = f.read()# Parse using BeautifulSouptree = BeautifulSoup(html_doc, 'html.parser')# print the p element contentprint(tree.body.div.p.text)
Lines 4–5: We start by reading the HTML document.
Lines 7–9: We parse the document and convert it to a tree structure using a Python library called
BeautifulSoup
, which we will discuss later. Once we have the tree, we can navigate it and reach any element using its path from the root. ...