Data Preprocessing
Learn how to clean and process data using NLTK before using the LTSM model.
We'll cover the following...
Prior to designing a model, it's important to process the data that was covered previously
Text vectorization with Keras
We’ll use scikit-learn’s TfidfVectorizer function to convert the text data to integer representations. The function expects the maximum number of features.
from jax import numpy as jnpfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.model_selection import train_test_splitvectorizer = TfidfVectorizer(max_features=10000)X = vectorizer.fit_transform(docs).toarray()X_train, X_test , y_train, y_test = train_test_split(docs, labels , test_size = 0.20, random_state=0)print("X_train\n", X_train)print("\nX_test\n", X_test)
In the code above:
Lines 1–3: We import the required modules:
numpyfromjaxasjnp,TfidfVectorizerfromsklearn.feature_extraction.text, andtrain_test_splitfromsklearn.model_selection.Line 5: We create an instance of the
TfidfVectorizerclass with themax_featuresof10000.Line 6: We call the
fit_transforms()method of theTfidfVectorizerclass to convert thedocsinto TF-IDF values. We also call theto_array()method to convert these TF-IDF values into an array and store it in theXvariable.Lines 7: We ...