Text Preprocessing and Sentiment Analysis
Learn how to process text data using TensorFlow to train a JAX model.
Text vectorization
Next, we use scikit-learn’s TfidfTransformer
function to convert the text data to integer representations. The function expects the maximum number of features.
from jax import numpy as jnpfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.model_selection import train_test_splitvectorizer = TfidfVectorizer(max_features=10000)X = vectorizer.fit_transform(docs).toarray()X_train, X_test , y_train, y_test = train_test_split(X, labels , test_size = 0.20)X_train = jnp.array(X_train, dtype=jnp.float16)X_test = jnp.array(X_test, dtype=jnp.float16)print("X_train\n",X_train)print("\nX_test\n",X_test)
In the code above:
Lines 1–3: We import the required modules:
numpy
fromjax
asjnp
,TfidfVectorizer
fromsklearn.feature_extraction.text
, andtrain_test_split
fromsklearn.model_selection
.Line 4: We create an instance of the
TfidfVectorizer
class with themax_features
of10000
.Line 5: We call the
fit_transforms()
method of theTfidfVectorizer
class to convert thedocs
into TF-IDF values. We also call theto_array()
method to convert these TF-IDF values into an array and store it in theX
variable.Lines 6: We use the
train_test_split()
function to split the dataset (X
andlabels
) into train (X_train
andy_train
) and test (X_test
andy_test
) datasets.Lines 9–10: We convert the
X_train
andX_test
into JAX arrays.Lines 12–13: We print the
X_train
andX_test
.
Next, we use TensorFlow’s TextVectorization()
function to convert the text data to integer representations. The function expects that:
- We use
standardize
to specify how the text data is processed. For example, thelower_and_strip_punctuation
option will lowercase the data and remove punctuations. - We use
max_tokens
to dictate the maximum size of the vocabulary. - We use
output_mode
to determine the output of the vectorization layer. Theint
setting outputs integers. - We use
output_sequence_length
to indicate the maximum length of the output sequence. This ensures that all sequences have the same length.
import tensorflow as tfmax_features = 5000batch_size = 32max_len = 512vectorize_layer = tf.keras.layers.TextVectorization(standardize='lower_and_strip_punctuation',max_tokens=max_features,output_mode='int',output_sequence_length=max_len)vectorize_layer.adapt(X_train,batch_size=None)
Preparing training and testing data
Next, we apply this layer to the training and testing data.
X_train_padded = vectorize_layer(X_train)X_test_padded = vectorize_layer(X_test)
Let’s convert the data to a TensorFlow dataset and create a function to fetch the data in batches. We’ll also convert the data to NumPy arrays because JAX expects NumPy or JAX arrays. Here, tfds.dataset_as_numpy
converts the tf.data.Dataset
into an iterable of NumPy arrays.
import tensorflow_datasets as tfdstraining_data = tf.data.Dataset.from_tensor_slices((X_train_padded, y_train))validation_data = tf.data.Dataset.from_tensor_slices((X_test_padded, y_test))training_data = training_data.batch(batch_size)validation_data = validation_data.batch(batch_size)def get_train_batches():ds = training_data.prefetch(1)return tfds.as_numpy(ds)
In the code above:
Line 1: We import the
tensorflow_datasets
astfds
.Lines 3–4: We call the
from_tensor_slices()
method of thetf.data.Dataset
module to create the TensorFlowDataset
objects for training and testing datasets. We named the test data as validation data.Lines 5–6: We call ...