In Natural Language Processing, tokenization is used to divide a string into a list of tokens. Tokens come in handy when finding valuable patterns and helping to replace sensitive data components with non-sensitive ones.
Tokens can be thought of as a word in a sentence or a sentence in a paragraph.
The sent_tokenize
function in Python can tokenize inserted text into sentences.
In Python, we can tokenize with the help of the Natural Language Toolkit (NLTK
) library. The library needs to be imported in the code.
NLTK
With Python 2.x, NLTK
can be installed in the device using the command shown below:
pip install nltk
With Python 3.x, NLTK
can be installed in the device using the following command:
pip3 install nltk
However, installation is not yet complete. In the Python file, the code below needs to be run:
import nltk
nltk.download()
Upon executing the code, an interface will pop up. Under the heading of collections, click on all and then click on download to finish the installation.
The code below explains how the sent_tokenize
function operates.
from nltk.tokenize import sent_tokenize# input paragraphsentence = "Hello, Awesome Reader, how are you doing today? The weather is great, and Python is awesome."# tokens createdtokens = sent_tokenize(sentence)print(tokens)
Free Resources