...

/

Generating Data for GloVe

Generating Data for GloVe

Learn to generate data for GloVe embeddings.

We'll cover the following...

We’ll be using the BBC news articles dataset. It contains 2,225 news articles belonging to five topics, which include business, entertainment, politics, sports, and tech, and were published on the BBC website between 2004 and 2005.

The glove_data_generator() function

Let’s now generate the data. We’ll be encapsulating the data generation in a function called glove_data_generator(). As the first step, let us write a function signature:

def glove_data_generator(sequences, window_size, batch_size, vocab_size, cooccurrence_matrix, x_max=100.0, alpha=0.75, seed=None):

The function takes several arguments:

  • sequences (List[List[int]]): This is a list of a list of word IDs. This is the output generated by the tokenizer’s texts_to_sequences() function.
  • window_size (int): This is the window size for the context.
  • batch_size (int): This is the batch size.
  • vocab_size (int): This is the vocabulary size.
  • cooccurrence_matrix (scipy.sparse.lil_matrix): This is a sparse matrix containing co-occurrences of words.
  • x_max (int): This is the hyperparameter used by GloVe to compute sample weights.
  • alpha (float): This is the hyperparameter used by GloVe to compute sample weights.
  • seed: This is the random seed.

It also has several outputs:

  • A batch of (target, context) word ID tuples.
  • The corresponding log(Xij)log(X_{ij})
...