...

/

Data Preprocessing

Data Preprocessing

Learn how to clean and process data using NLTK before using the LTSM model.

Prior to designing a model, it's important to process the data that was covered previously

Text vectorization with Keras

We’ll use scikit-learn’s TfidfVectorizer function to convert the text data to integer representations. The function expects the maximum number of features.

Press + to interact
from jax import numpy as jnp
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
vectorizer = TfidfVectorizer(max_features=10000)
X = vectorizer.fit_transform(docs).toarray()
X_train, X_test , y_train, y_test = train_test_split(docs, labels , test_size = 0.20, random_state=0)
print("X_train\n", X_train)
print("\nX_test\n", X_test)

In the code above:

  • Lines 1–3: We import the required modules: numpy from jax as jnp, TfidfVectorizer from sklearn.feature_extraction.text, and train_test_split from sklearn.model_selection.

  • Line 5: We create an instance of the TfidfVectorizer class with the max_features of 10000.

  • Line 6: We call the fit_transforms() method of the TfidfVectorizer class to convert the docs into TF-IDF values. We also call the to_array() method to convert these TF-IDF values into an array and store it in the X variable.

  • Lines 7: We ...