In Natural Language Processing, tokenization divides a string into a list of tokens. Tokens come in handy when finding valuable patterns and helping to replace sensitive data components with non-sensitive ones.
Tokens can be though of as a word in a sentence or a sentence in a paragraph.
word_tokenize
is a function in Python that splits a given sentence into words using the NLTK
library.
Figure 1 below shows the tokenization of sentence into words.
In Python, we can tokenize with the help of the Natural Language Toolkit (NLTK
) library.
NLTK
With Python 2.x, NLTK
can be installed in the device with the command shown below:
pip install nltk
With Python 3.x, NLTK
can be installed in the device with the command shown below:
pip3 install nltk
However, installation is not yet complete. In the Python file, the code shown below needs to be run:
import nltk
nltk.download()
Upon executing the code, an interface will pop up. Under the heading of collections, click on “all” and then click on “download” to finish the installation.
The code below explains how the word_tokenize
function operates.
Some special characters, such as commas, are also treated as tokens.
word_tokenize
function is imported from the nltk.tokenize
library.from nltk.tokenize import word_tokenizedata = "Hello, Awesome User"# tokenization of sentence into wordstokens = word_tokenize(data)# printing the tokensprint(word_tokenize(data))