Solution Explanations: Advanced Text Preprocessing
Review solution explanations for the code challenges on advanced text preprocessing.
We'll cover the following...
Solution 1: Part-of-speech tagging
Here’s the solution:
Press + to interact
main.py
feedback.csv
import pandas as pdimport nltkimport stringfrom nltk.tokenize import word_tokenizefrom nltk.corpus import stopwordsnltk.download('averaged_perceptron_tagger', quiet=True)feedback_df = pd.read_csv('feedback.csv')feedback_df['tokens'] = feedback_df['feedback'].apply(lambda text: word_tokenize(text.lower()))stop_words = set(stopwords.words('english'))feedback_df['tokens'] = feedback_df['tokens'].apply(lambda tokens: [token for token in tokens if token not in stop_words])feedback_df['tokens'] = feedback_df['tokens'].apply(lambda tokens: [token for token in tokens if token not in string.punctuation])feedback_df['pos_tags'] = feedback_df['tokens'].apply(nltk.pos_tag)print(feedback_df['pos_tags'])
Let’s go through the solution explanation:
Line 9: We tokenize the text in the
text
column using theword_tokenize
function and convert each token to lowercase. We then save the tokenized text as a newtokens
column.Line 10: We create a set of stopwords using the
stopwords.words('english')
function.Lines 11–12: ...
Access this course and 1400+ top-rated courses and projects.