How to use get_text() in Beautiful Soup

Key takeaways:
get_text() extracts the human-readable text from HTML tags, allowing easy retrieval of content without the surrounding HTML structure.
The separator argument of get_text() allows you to define a custom separator between nested elements, while strip removes leading/trailing whitespace from the extracted text.
Always check for the existence of the element before calling get_text() to avoid errors if the element is missing.
Using get_text() with other Beautiful Soup methods like find() or find_all() simplifies text extraction for more effective and structured web scraping.

Beautiful Soup is a Python library used for web scraping and parsing HTML and XML documents. One of its methods is get_text(), which allows us to retrieve human-readable text content from HTML tags.

The get_text() enables us to extract the textual content of an HTML element. When we scrape web pages, we often need to extract the actual text from specific HTML tags like paragraphs, headings, or span elements, rather than dealing with the entire HTML structure. The get_text() comes in handy in such situations, as it enables us to retrieve just the text we need.

Syntax of the `get_text()`

The get_text() method in Beautiful Soup returns the concatenated text of all elements in the parsed page, excluding any tags.

separator (optional): A string to insert between each tag’s text. The default is an empty string ("").
strip (optional): If True, it removes leading and trailing whitespaces from the extracted text. Default is False.

Basic usage of the `get_text()`

To make use of the get_text(), we first need to create a Beautiful Soup object by parsing the HTML content using a suitable parser, such as the html.parser. After obtaining the Beautiful Soup object, we can access the desired element using various methods like find(), find_all(), or select(). Once we have located the desired element, we can call the get_text() to extract its text content.

main.py

sample.html

<!DOCTYPE html>
<html>
<head>
    <title>Educative - Learn, Explore, and Grow</title>
</head>
<body>
    <header>
        <h1>Welcome to Educative</h1>
        <nav>
            <ul>
                <li>Courses with Assessments</li>
                <li>Assessments</li>
                <li>Blog</li>
                <li>About Us</li>
            </ul>
        </nav>
    </header>
    <div class='description'>
      Educative provides interactive courses for software developers. <br />We are changing how developers continue their education and stay relevant by providing pre-configured learning environments that adapt to match a developer's skill level.
    </div>
</body>
</html>

Dealing with nested elements

HTML documents often have nested elements, where an element contains other elements. To format the retrieved text from nested elements, we can use two arguments:

separator
strip

1. Using `separator`

When an element contains multiple child elements, such as paragraphs with span tags or lists with list items, the default behavior of the get_text() is to concatenate the text of all the child elements into one string, effectively removing any distinction between them. However, by specifying the separator argument, we can control what character or string should be used to separate the text content of each child element. This is particularly useful when we want to preserve the structure of the original HTML and maintain a clear distinction between different elements' text content.

main.py

sample.html

<!DOCTYPE html>
<html>
<head>
    <title>Educative - Learn, Explore, and Grow</title>
</head>
<body>
    <header>
        <h1>Welcome to Educative</h1>
        <nav>
            <ul>
                <li>Courses with Assessments</li>
                <li>Assessments</li>
                <li>Blog</li>
                <li>About Us</li>
            </ul>
        </nav>
    </header>
    <div class='description'>
      Educative provides interactive courses for software developers. <br />We are changing how developers continue their education and stay relevant by providing pre-configured learning environments that adapt to match a developer's skill level.
    </div>
</body>
</html>

main.py

sample.html

<!DOCTYPE html>
<html>
<head>
    <title>Educative - Learn, Explore, and Grow</title>
</head>
<body>
    <header>
        <h1>Welcome to Educative</h1>
        <nav>
            <ul>
                <li>Courses with Assessments</li>
                <li>Assessments</li>
                <li>Blog</li>
                <li>About Us</li>
            </ul>
        </nav>
    </header>
    <div class='description'>
      Educative provides interactive courses for software developers. <br />We are changing how developers continue their education and stay relevant by providing pre-configured learning environments that adapt to match a developer's skill level.
    </div>
</body>
</html>

Ready to master web scraping? 🚀

Unlock the power of web scraping with our course on Mastering Web Scraping Using Python: From Beginner to Advanced! Whether you’re a beginner or looking to enhance your skills, this course will guide you through the essentials to advanced techniques in web scraping.

Conclusion

The get_text() is a valuable function in Beautiful Soup that simplifies the process of extracting text content from HTML elements during web scraping tasks. By understanding its basic usage, handling nested elements, dealing with missing elements, and combining it with other Beautiful Soup features, we can efficiently and effectively perform web scraping tasks. Incorporating the get_text() in our web scraping projects empowers us to retrieve specific text data from web pages, making our data analysis and processing more robust and accurate.

Frequently asked questions

Haven’t found what you were looking for? Contact Us

How to use find in BeautifulSoup?

The find() method locates the first occurrence of an HTML element matching a given tag, class, or other attributes.

Check out our detailed Answer on How to use Beautiful Soup’s find() method.

What does get_text() do in Python?

The get_text() extracts all the text content from a BeautifulSoup object, stripping away HTML tags and returning only the readable text.

How do I find all links in BeautifulSoup?

Use find_all('a') to get all <a> tags (links) in the document. Example: soup.find_all('a').

Check out our detailed Answer on How to use Beautiful Soup’s find_all() method.

How do you find tags with text in BeautifulSoup?

To find tags with text, use the find_all() method with the text argument:

elements = soup.find_all(text='target_text')

How to get text from div class in BeautifulSoup?

To get text from div class, use find() or find_all() to locate the div, then call get_text():

elements = soup.find_all('div', attrs={'class': 'course'})
print("Div with class: course:  ")
for element in elements:
    print(element.get_text())

What is a tag in BeautifulSoup?

A tag is an HTML element (e.g., <div>, <p>, <h1>) represented as a BeautifulSoup object, which contains attributes, text, and child elements.

Free AI Mock Interviews

Coding Interview

Coding PatternsFree Interview

Gain insights and practical experience with coding patterns through targeted MCQs and coding problems, designed to match and challenge your expertise level.

System Design

YouTubeFree Interview

Learn to design a video streaming platform like YouTube by tackling functional and non-functional requirements, core components, and high-level to detailed design challenges.

Free Resources

How to use get_text() in Beautiful Soup

Syntax of the `get_text()`

Basic usage of the `get_text()`

Dealing with nested elements

1. Using `separator`

2. Using `strip`

Handling missing elements

Conclusion

Frequently asked questions

How to use find in BeautifulSoup?

What does get_text() do in Python?

How do I find all links in BeautifulSoup?

How do you find tags with text in BeautifulSoup?

How to get text from div class in BeautifulSoup?

What is a tag in BeautifulSoup?

How to use get_text() in Beautiful Soup

Syntax of the get_text()

Basic usage of the get_text()

Dealing with nested elements

1. Using separator

2. Using strip

Handling missing elements

Conclusion

Frequently asked questions

How to use find in BeautifulSoup?

What does get_text() do in Python?

How do I find all links in BeautifulSoup?

How do you find tags with text in BeautifulSoup?

How to get text from div class in BeautifulSoup?

What is a tag in BeautifulSoup?

Syntax of the `get_text()`

Basic usage of the `get_text()`

1. Using `separator`

2. Using `strip`