What is word_tokenize in Python?

In Natural Language Processing, tokenization divides a string into a list of tokens. Tokens come in handy when finding valuable patterns and helping to replace sensitive data components with non-sensitive ones.

Tokens can be though of as a word in a sentence or a sentence in a paragraph.

word_tokenize is a function in Python that splits a given sentence into words using the NLTK library.

Figure 1 below shows the tokenization of sentence into words.

Figure 1: Splitting of a sentence into words.

In Python, we can tokenize with the help of the Natural Language Toolkit (NLTK) library.

Installation of NLTK

With Python 2.x, NLTK can be installed in the device with the command shown below:

pip install nltk

With Python 3.x, NLTK can be installed in the device with the command shown below:

pip3 install nltk

However, installation is not yet complete. In the Python file, the code shown below needs to be run:

import nltk
nltk.download()

Upon executing the code, an interface will pop up. Under the heading of collections, click on “all” and then click on “download” to finish the installation.

Example

The code below explains how the word_tokenize function operates.

Some special characters, such as commas, are also treated as tokens.

  • In line 1, the word_tokenize function is imported from the nltk.tokenize library.
  • In line 3, the comma in the sentence will be displayed in the output as a separate token.
from nltk.tokenize import word_tokenize
data = "Hello, Awesome User"
# tokenization of sentence into words
tokens = word_tokenize(data)
# printing the tokens
print(word_tokenize(data))
Copyright ©2024 Educative, Inc. All rights reserved