What is the clean() method of the clean-text package in Python?

What is the `clean-text` package?

clean-text is a third-party package that preprocesses text data to obtain a normalized text representation.

The package can be installed via pip. Check the following command to install the clean-text package:

pip install clean-text

`clean()` method

The clean() method replaces all the URLs in the given text with the replacement string.

Method signature

clean(
    text,
    fix_unicode=True,
    to_ascii=True,
    lower=True,
    normalize_whitespace=True,
    no_line_breaks=False,
    strip_lines=True,
    keep_two_line_breaks=False,
    no_urls=False,
    no_emails=False,
    no_phone_numbers=False,
    no_numbers=False,
    no_digits=False,
    no_currency_symbols=False,
    no_punct=False,
    no_emoji=False,
    replace_with_url="<URL>",
    replace_with_email="<EMAIL>",
    replace_with_phone_number="<PHONE>",
    replace_with_number="<NUMBER>",
    replace_with_digit="0",
    replace_with_currency_symbol="<CUR>",
    replace_with_punct="",
    lang="en",
)

Parameters

text: This is the text to preprocess.
fix_unicode=True: A boolean value indicating whether or not to fix broken unicodes.
to_ascii=True: If this is True then it converts non-to_ascii characters into their closest to_ascii equivalents.
lower=True: If this is True, it converts the text to lowercase.
no_line_breaks=False: If this is True, it strips the line breaks from the text.
no_urls=False: This is a boolean value that indicates replacing all the URL strings in the text with a special URL token.
no_emails=False: This is a boolean value that indicates whether to replace all emails in the text with a special EMAIL token.
no_phone_numbers=False: This is a boolean value indicating whether to replace all the phone numbers in the text with a special PHONE token.
no_numbers=False: This is a boolean value indicating whether to replace all the numbers in the text with a special NUMBER token.
no_digits=False: This is a boolean value indicating whether to replace all the digits in the text with a special DIGIT token.
no_currency_symbols=False: This is a boolean value indicating whether to replace all the currency symbols in the text with a special CURRENCY token.
no_punct=False: This is a boolean value indicating whether to remove all the punctuations in the text.
replace_with_url="<URL>": This is the special URL token. The default value is <URL>.
replace_with_email="<EMAIL>": This is the special EMAIL token. The default value is <EMAIL>.
replace_with_phone_number="<PHONE>": This is the special PHONE token. The default value is <PHONE>.
replace_with_number="<NUMBER>": This is the special NUMBER token. The default value is <NUMBER>.
replace_with_digit="0": This is the special DIGIT token. The default value is 0.
replace_with_currency_symbol="<CUR>": This is the special CURRENCY token. The default value is <CUR>.
replace_with_punct="": We replace the punctuations with this string. The default value is an empty string.
lang="en": This is a parameter to mention the language that indicates the type of text preprocessing. The default value is English (‘en’). Other than English, only German (‘de’) is supported.

Return value

The method returns the cleaned text depending on the different parameters passed.

Code example 1

Free Resources

License: Creative Commons-Attribution-ShareAlike 4.0 (CC-BY-SA 4.0)

Learn in-demand tech skills in half the time

PRODUCTS

Mock Interview

New

Courses

Skill Paths

Projects

Assessments

What is the clean() method of the clean-text package in Python?

What is the `clean-text` package?

`clean()` method

Method signature

Parameters

Return value

Code example 1

Code explanation

Code example 2

Code explanation

What is the clean() method of the clean-text package in Python?

What is the clean-text package?

clean() method

Method signature

Parameters

Return value

Code example 1

Code explanation

Code example 2

Code explanation

What is the `clean-text` package?

`clean()` method