clean-text
package?clean-text is a third-party package that preprocesses text data to obtain a normalized text representation.
The package can be installed via pip. Check the following command to install the clean-text package:
pip install clean-text
clean()
methodThe clean()
method replaces all the URLs in the given text with the replacement string.
clean(
text,
fix_unicode=True,
to_ascii=True,
lower=True,
normalize_whitespace=True,
no_line_breaks=False,
strip_lines=True,
keep_two_line_breaks=False,
no_urls=False,
no_emails=False,
no_phone_numbers=False,
no_numbers=False,
no_digits=False,
no_currency_symbols=False,
no_punct=False,
no_emoji=False,
replace_with_url="<URL>",
replace_with_email="<EMAIL>",
replace_with_phone_number="<PHONE>",
replace_with_number="<NUMBER>",
replace_with_digit="0",
replace_with_currency_symbol="<CUR>",
replace_with_punct="",
lang="en",
)
text
: This is the text to preprocess.fix_unicode=True
: A boolean value indicating whether or not to fix broken unicodes.to_ascii=True
: If this is True
then it converts non-to_ascii characters into their closest to_ascii equivalents.lower=True
: If this is True
, it converts the text to lowercase.no_line_breaks=False
: If this is True
, it strips the line breaks from the text.no_urls=False
: This is a boolean value that indicates replacing all the URL strings in the text with a special URL token.no_emails=False
: This is a boolean value that indicates whether to replace all emails in the text with a special EMAIL token.no_phone_numbers=False
: This is a boolean value indicating whether to replace all the phone numbers in the text with a special PHONE token.no_numbers=False
: This is a boolean value indicating whether to replace all the numbers in the text with a special NUMBER token.no_digits=False
: This is a boolean value indicating whether to replace all the digits in the text with a special DIGIT token.no_currency_symbols=False
: This is a boolean value indicating whether to replace all the currency symbols in the text with a special CURRENCY token.no_punct=False
: This is a boolean value indicating whether to remove all the punctuations in the text.replace_with_url="<URL>"
: This is the special URL token. The default value is <URL>
.replace_with_email="<EMAIL>"
: This is the special EMAIL token. The default value is <EMAIL>
.replace_with_phone_number="<PHONE>"
: This is the special PHONE token. The default value is <PHONE>
.replace_with_number="<NUMBER>"
: This is the special NUMBER token. The default value is <NUMBER>
.replace_with_digit="0"
: This is the special DIGIT token. The default value is 0
.replace_with_currency_symbol="<CUR>"
: This is the special CURRENCY token. The default value is <CUR>
.replace_with_punct=""
: We replace the punctuations with this string. The default value is an empty string.lang="en"
: This is a parameter to mention the language that indicates the type of text preprocessing. The default value is English (‘en’). Other than English, only German (‘de’) is supported.The method returns the cleaned text depending on the different parameters passed.
import cleantexttxt = "Hello Educative!!! How are you?"new_txt = cleantext.clean(txt, no_punct=True)print("Original String - '" + txt + "'")print("Modified String after removing punctuations - '" + new_txt + "'")
cleantext
package.txt
with punctuations.txt
using the clean
method and passing no_punct
as True
. The result is stored in new_txt
.import cleantexttxt = "Hello Educative!!! 123 How are you? 456"new_txt = cleantext.clean(txt, no_numbers=True)print("Original String - '" + txt + "'")print("Modified String after replacing numbers - '" + new_txt + "'")
cleantext
package.txt
with numbers in it.txt
with the special NUMBER token using the clean
method and passing no_numbers
as True
. The result is stored in new_txt
.