Mastering Web Scraping Using Python: From Beginner to Advanced/

...

Essentials of HTML

Discover HTML concepts and features.

We'll cover the following...

Introduction
Markup languages vs. programming languages
Structure
Tags
Attributes
- Importance of attributes
Conclusion

Mastering HTML is a requirement for pursuing a profession in web development. However, we will look at the basics required to achieve our goal of web scraping.

Introduction

HyperText Markup Language (HTML) is a standard markup language used to create web pages. It explains the structure of the web pages using tags. These tags inform the browser on how to display the information, such as a title, heading, image, link, or any other type.

Markup languages vs. programming languages

Here are some common differences between markup and programming languages:

Markup Language	Programming Language
Primarily used for defining and describing content	Primarily used for writing executable programs
Does not require compilation or execution	Requires compilation and/or execution to produce output
Used to format text, images, and multimedia content	Used to build applications, software, and systems
Examples: HTML, XML, Markdown	Examples: Python, Java, C++, JavaScript

Structure

Each HTML document can be considered a document tree. We define the tree's components in the same way that we would describe a family tree, with each node in the tree being an HTML tag that can contain other tags as children.

Press + to interact

HTML document tree

The browser’s job is to understand each element’s purpose and display it correctly. This tree structure is called Document Object Model (DOM), which treats an HTML doc as a tree. Once we have the DOM, we can easily search and retrieve any element (node) we want. We will not do the conversion or write the search algorithms ourselves; Python libraries and tools will do the job for us. All we need to have is the path for the tag.

Press + to interact

Python 3.8

Files

from bs4 import BeautifulSoup
# reading the page content
with open('page.html', 'r') as f:
    html_doc = f.read()
# Parse using BeautifulSoup
tree = BeautifulSoup(html_doc, 'html.parser')
# print the p element content
print(tree.body.div.p.text)

Lines 4–5: We start by reading the HTML document.
Lines 7–9: We parse the document and convert it to a tree structure using a Python library called BeautifulSoup, which we will discuss later. Once we have the tree, we can navigate it and reach any element using its path from the root. ...

Introduction to Course Content and Web Scraping

Fundamental Concepts of Web Scraping

Dynamic Sites with Selenium

Assessment: Python Scraping

Scrapy Framework

Scraping Educative’s Courses Information

Wrap Up

Essentials of HTML

Introduction

Markup languages vs. programming languages

Structure

Tags