How to use get_text() in Beautiful Soup

Key takeaways:

  • get_text() extracts the human-readable text from HTML tags, allowing easy retrieval of content without the surrounding HTML structure.

  • The separator argument of get_text() allows you to define a custom separator between nested elements, while strip removes leading/trailing whitespace from the extracted text.

  • Always check for the existence of the element before calling get_text() to avoid errors if the element is missing.

  • Using get_text() with other Beautiful Soup methods like find() or find_all() simplifies text extraction for more effective and structured web scraping.

Beautiful Soup is a Python library used for web scraping and parsing HTML and XML documents. One of its methods is get_text(), which allows us to retrieve human-readable text content from HTML tags.

The get_text() enables us to extract the textual content of an HTML element. When we scrape web pages, we often need to extract the actual text from specific HTML tags like paragraphs, headings, or span elements, rather than dealing with the entire HTML structure. The get_text() comes in handy in such situations, as it enables us to retrieve just the text we need.

Syntax of the get_text()

The get_text() method in Beautiful Soup returns the concatenated text of all elements in the parsed page, excluding any tags.

tag.get_text(separator="", strip=False)
  • separator (optional): A string to insert between each tag’s text. The default is an empty string ("").

  • strip (optional): If True, it removes leading and trailing whitespaces from the extracted text. Default is False.

Basic usage of the get_text()

To make use of the get_text(), we first need to create a Beautiful Soup object by parsing the HTML content using a suitable parser, such as the html.parser. After obtaining the Beautiful Soup object, we can access the desired element using various methods like find(), find_all(), or select(). Once we have located the desired element, we can call the get_text() to extract its text content.

main.py
sample.html
<!DOCTYPE html>
<html>
<head>
<title>Educative - Learn, Explore, and Grow</title>
</head>
<body>
<header>
<h1>Welcome to Educative</h1>
<nav>
<ul>
<li>Courses with Assessments</li>
<li>Assessments</li>
<li>Blog</li>
<li>About Us</li>
</ul>
</nav>
</header>
<div class='description'>
Educative provides interactive courses for software developers. <br />We are changing how developers continue their education and stay relevant by providing pre-configured learning environments that adapt to match a developer's skill level.
</div>
</body>
</html>

Dealing with nested elements

HTML documents often have nested elements, where an element contains other elements. To format the retrieved text from nested elements, we can use two arguments:

  • separator

  • strip

1. Using separator

When an element contains multiple child elements, such as paragraphs with span tags or lists with list items, the default behavior of the get_text() is to concatenate the text of all the child elements into one string, effectively removing any distinction between them. However, by specifying the separator argument, we can control what character or string should be used to separate the text content of each child element. This is particularly useful when we want to preserve the structure of the original HTML and maintain a clear distinction between different elements' text content.

main.py
sample.html
from bs4 import BeautifulSoup
# Assuming we have an HTML document in the 'html_content' variable
soup = BeautifulSoup(html_content, 'html.parser')
# Find the desired element and extract its text using get_text()
element = soup.find('body')
#get_text with strip set to true
text_content = element.get_text(separator=' | ')
print("Text: \n", text_content)

In the code above, the specified separator is concatenated in the text at the places of all the tags, including the <br/>.

2. Using strip

The strip argument in the get_text() method allows us to control the handling of leading and trailing whitespaces, including newline characters. By default, the strip is set to False. We can set it to True, when we want to clean up the text and eliminate any unnecessary spaces or line breaks.

main.py
sample.html
<!DOCTYPE html>
<html>
<head>
<title>Educative - Learn, Explore, and Grow</title>
</head>
<body>
<header>
<h1>Welcome to Educative</h1>
<nav>
<ul>
<li>Courses with Assessments</li>
<li>Assessments</li>
<li>Blog</li>
<li>About Us</li>
</ul>
</nav>
</header>
<div class='description'>
Educative provides interactive courses for software developers. <br />We are changing how developers continue their education and stay relevant by providing pre-configured learning environments that adapt to match a developer's skill level.
</div>
</body>
</html>

Handling missing elements

When using the get_text(), it is essential to consider situations where the desired element may not exist in the HTML document. To avoid errors, we can check if the element exists before extracting its text.

main.py
sample.html
<!DOCTYPE html>
<html>
<head>
<title>Educative - Learn, Explore, and Grow</title>
</head>
<body>
<header>
<h1>Welcome to Educative</h1>
<nav>
<ul>
<li>Courses with Assessments</li>
<li>Assessments</li>
<li>Blog</li>
<li>About Us</li>
</ul>
</nav>
</header>
<div class='description'>
Educative provides interactive courses for software developers. <br />We are changing how developers continue their education and stay relevant by providing pre-configured learning environments that adapt to match a developer's skill level.
</div>
</body>
</html>

Ready to master web scraping? 🚀

Unlock the power of web scraping with our course on Mastering Web Scraping Using Python: From Beginner to Advanced! Whether you’re a beginner or looking to enhance your skills, this course will guide you through the essentials to advanced techniques in web scraping.

Conclusion

The get_text() is a valuable function in Beautiful Soup that simplifies the process of extracting text content from HTML elements during web scraping tasks. By understanding its basic usage, handling nested elements, dealing with missing elements, and combining it with other Beautiful Soup features, we can efficiently and effectively perform web scraping tasks. Incorporating the get_text() in our web scraping projects empowers us to retrieve specific text data from web pages, making our data analysis and processing more robust and accurate.

Frequently asked questions

Haven’t found what you were looking for? Contact Us


How to use find in BeautifulSoup?

The find() method locates the first occurrence of an HTML element matching a given tag, class, or other attributes.

Check out our detailed Answer on How to use Beautiful Soup’s find() method.


What does get_text() do in Python?

The get_text() extracts all the text content from a BeautifulSoup object, stripping away HTML tags and returning only the readable text.


How do I find all links in BeautifulSoup?

Use find_all('a') to get all <a> tags (links) in the document. Example: soup.find_all('a').

Check out our detailed Answer on How to use Beautiful Soup’s find_all() method.


How do you find tags with text in BeautifulSoup?

To find tags with text, use the find_all() method with the text argument:

elements = soup.find_all(text='target_text')

How to get text from div class in BeautifulSoup?

To get text from div class, use find() or find_all() to locate the div, then call get_text():

elements = soup.find_all('div', attrs={'class': 'course'})
print("Div with class: course:  ")
for element in elements:
    print(element.get_text())

What is a tag in BeautifulSoup?

A tag is an HTML element (e.g., <div>, <p>, <h1>) represented as a BeautifulSoup object, which contains attributes, text, and child elements.


Free Resources

Copyright ©2025 Educative, Inc. All rights reserved