...

/

Text Preprocessing and Sentiment Analysis

Text Preprocessing and Sentiment Analysis

Learn how to process text data using TensorFlow to train a JAX model.

Text vectorization

Next, we use scikit-learn’s TfidfTransformer function to convert the text data to integer representations. The function expects the maximum number of features.

Press + to interact
from jax import numpy as jnp
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
vectorizer = TfidfVectorizer(max_features=10000)
X = vectorizer.fit_transform(docs).toarray()
X_train, X_test , y_train, y_test = train_test_split(X, labels , test_size = 0.20)
X_train = jnp.array(X_train, dtype=jnp.float16)
X_test = jnp.array(X_test, dtype=jnp.float16)
print("X_train\n",X_train)
print("\nX_test\n",X_test)

In the code above:

  • Lines 1–3: We import the required modules: numpy from jax as jnp, TfidfVectorizer from sklearn.feature_extraction.text, and train_test_split from sklearn.model_selection.

  • Line 4: We create an instance of the TfidfVectorizer class with the max_features of 10000.

  • Line 5: We call the fit_transforms() method of the TfidfVectorizer class to convert the docs into TF-IDF values. We also call the to_array() method to convert these TF-IDF values into an array and store it in the X variable.

  • Lines 6: We use the train_test_split() function to split the dataset (X and labels) into train (X_train and y_train) and test (X_test and y_test) datasets.

  • Lines 9–10: We convert the X_train and X_test into JAX arrays.

  • Lines 12–13: We print the X_train and X_test.

Next, we use TensorFlow’s TextVectorization() function to convert the text data to integer representations. The function expects that:

  • We use standardize to specify how the text data is processed. For example, the lower_and_strip_punctuation option will lowercase the data and remove punctuations.
  • We use max_tokens to dictate the maximum size of the vocabulary.
  • We use output_mode to determine the output of the vectorization layer. The int setting outputs integers.
  • We use output_sequence_length to indicate the maximum length of the output sequence. This ensures that all sequences have the same length.
Press + to interact
import tensorflow as tf
max_features = 5000
batch_size = 32
max_len = 512
vectorize_layer = tf.keras.layers.TextVectorization(standardize='lower_and_strip_punctuation',max_tokens=max_features,output_mode='int',output_sequence_length=max_len)
vectorize_layer.adapt(X_train,batch_size=None)

Preparing training and testing data

Next, we apply this layer to the training and testing data.

Press + to interact
X_train_padded = vectorize_layer(X_train)
X_test_padded = vectorize_layer(X_test)

Let’s convert the data to a TensorFlow dataset and create a function to fetch the data in batches. We’ll also convert the data to NumPy arrays because JAX expects NumPy or JAX arrays. Here, tfds.dataset_as_numpy converts the tf.data.Dataset into an iterable of NumPy arrays.

Press + to interact
import tensorflow_datasets as tfds
training_data = tf.data.Dataset.from_tensor_slices((X_train_padded, y_train))
validation_data = tf.data.Dataset.from_tensor_slices((X_test_padded, y_test))
training_data = training_data.batch(batch_size)
validation_data = validation_data.batch(batch_size)
def get_train_batches():
ds = training_data.prefetch(1)
return tfds.as_numpy(ds)

In the code above:

  • Line 1: We import the tensorflow_datasets as tfds.

  • Lines 3–4: We call the from_tensor_slices() method of the tf.data.Dataset module to create the TensorFlow Dataset objects for training and testing datasets. We named the test data as validation data.

  • Lines 5–6: We call ...