Data Preprocessing
Learn how to clean and process data using NLTK before using the LTSM model.
We'll cover the following...
Prior to designing a model, it's important to process the data that was covered previously
Text vectorization with Keras
We’ll use scikit-learn’s TfidfVectorizer
function to convert the text data to integer representations. The function expects the maximum number of features.
from jax import numpy as jnpfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.model_selection import train_test_splitvectorizer = TfidfVectorizer(max_features=10000)X = vectorizer.fit_transform(docs).toarray()X_train, X_test , y_train, y_test = train_test_split(docs, labels , test_size = 0.20, random_state=0)print("X_train\n", X_train)print("\nX_test\n", X_test)
In the code above:
Lines 1–3: We import the required modules:
numpy
fromjax
asjnp
,TfidfVectorizer
fromsklearn.feature_extraction.text
, andtrain_test_split
fromsklearn.model_selection
.Line 5: We create an instance of the
TfidfVectorizer
class with themax_features
of10000
.Line 6: We call the
fit_transforms()
method of theTfidfVectorizer
class to convert thedocs
into TF-IDF values. We also call theto_array()
method to convert these TF-IDF values into an array and store it in theX
variable.Lines 7: We ...