How to read and find tags of HTML in BeautifulSoup4

BeautifulSoup is a Python external library used for parsing from HTML and XML files and extracting information. It's used in web scrapingExtracting data and information from websites using bots.. It is also known as bs4 and beautifulsoup4.

It's not a built-in library of Python and needs to be first installed manually using the following command:


pip install BeautifulSoup4

After installing beautifulsoup4, we can import the package in our Python script and use its methods.

First, we read the HTML file before parsing it for information. To perform this task, we pass the file's content to the beautifulsoup constructor. The constructor takes two parameters, and the syntax is the following:

Variable = BeautifulSoup(contentVariable, ’html.parser’)
  • contentVariable is the variable that stores the content of the file.

  • html.parse is a positional parameter that lets beautifulsoup know to parse the contentVariable to HTML.

The find() method

After Parsing the HTML document, we can use the beautifulsoup methods to find the desired tags, attributes, or anything we like to get. To perform this task, we can use bs4's find() and find_all() methods.

Syntax

The syntax of find() method is the following:

Variable.find(nameOftags, attrs={attributeName="name"})

Example

The following example will demonstrate the working of the find() method:

main.py
index.html
from bs4 import BeautifulSoup
with open('index.html') as f:
content = f.read()
soup = BeautifulSoup(content, 'html.parser')
print(soup.find('meta'))
#uncomment the line below to print the whole document indented
# print(soup.prettify())
# this line will print the content within the meta tag
# print(soup.find('meta').text)

Explanation

The following is a brief explanation of the code above:

  • Line 1: We import the BeautifulSoup package used to parse the HTML document.

  • Line 3–5: We use Python's built-in function to open and read the index.html document and create an object of BeautifulSoup by passing the HTML document to the constructor for parsing.

  • Line 8: This line finds the first instance of the tag meta, and returns a string that prints on the console.

The find_all() method

The find() method only returns the first instance of the tag or attribute it takes as the parameter, whereas find_all() returns all the instances of the list of tags or attributes given in the parameter.

Syntax

The following is the the syntax of the find_all() :

Variable.find_all (listOfNames,attrs={attributeName="name"})

By default, it assumes a simple string in the parameter as a tag name, and if we want to find it based on attributes, we need to use the attrs keyword in the parameter and specify the attributes.

Example

In the following example, read the index.html file and print all the instances of meta tag:

main.py
index.html
from bs4 import BeautifulSoup
with open('index.html') as f:
content = f.read()
soup = BeautifulSoup(content, 'html.parser')
all_instances=soup.find_all('meta')
# print list using loop
for i in all_instances:
print(i)

Explanation

The following is a brief explanation of the code above:

  • Line 1: We import the BeautifulSoup package used to parse the HTML document.

  • Line 3–5: We use Python's built-in function to open and read the index.html document and create an object of the BeautifulSoup by passing the HTML document to the constructor for parsing.

  • Line 8: This line finds all the instances of the tag meta, and returns a list that is stored in all_instances.

  • Line 10–12: We use the loop to print the list such that each string prints on a new line on the console.

Free Resources

Copyright ©2024 Educative, Inc. All rights reserved