...

/

NER with Character and Token Embeddings

NER with Character and Token Embeddings

Learn to implement NER with character and token embeddings.

Nowadays, recurrent models used to solve the NER task are much more sophisticated than having just a single embedding layer and an RNN model. They involve using more advanced recurrent models like long short-term memory (LSTM), gated recurrent units (GRUs), etc. We’ll set aside the discussion about these advanced models. Here, we’ll focus our discussion on a technique that provides the model embeddings at multiple scales, enabling it to understand language better—that is, instead of relying only on token embeddings, also use character embeddings. Then, a token embedding is generated with the character embedding by shifting a convolutional window over the characters in the token.

Using convolution to generate token embeddings

A combination of character embeddings and a convolutional kernel can be used to generate token embeddings. The method will be as follows:

  1. Pad each token (e.g., word) to a predefined length.

  2. Look up the character embeddings for the characters in the token from an embedding layer.

  3. Shift a convolutional kernel over the sequence of character embeddings to generate a token embedding.

Press + to interact
How token embeddings are generated using character embeddings and the convolution operation
How token embeddings are generated using character embeddings and the convolution operation

The very first thing we need to do is analyze the statistics around how many characters there are for a token in our corpus. Similar to how we did it previously, we can do this with pandas:

vocab_ser = pd.Series(
pd.Series(train_sentences).str.split().explode().unique()
)
vocab_ser.str.len().describe(percentiles=[0.05, 0.95])

In computing vocab_ser, the first part (i.e., pd.Series(train_sentences).str.split()) will result in a pandas Series object whose elements are a list of tokens (each token in the sentence is an item of that list). Next, explode() will convert the Series of a list of tokens into a Series of tokens by converting each token into a separate item in the Series. Finally, we take only the unique tokens in that Series. Here, we end up with a pandas Series object where each item is a unique token.

We’ll now use the str.len() function to get the length of each token (i.e., the number of characters) and look at the 95% percentile in that. We’ll get the following:

count 23623.000000
mean 6.832705
std 2.749288
min 1.000000
5% 3.000000
50% 7.000000
95% 12.000000
max 61.000000
dtype: float64

We can see around 95% of our words have less than or equal to 12 characters.

Next, we’ll write a function to pad shorter sentences:

Press + to interact
def prepare_corpus_for_char_embeddings(tokenized_sentences, max_seq_length):
""" Pads each sequence to a maximum length """
proc_sentences = []
for tokens in tokenized_sentences:
if len(tokens) >= max_seq_length:
proc_sentences.append([[t] for t in tokens[:max_seq_length]])
else:
proc_sentences.append([[t] for t in tokens+['']*(max_seq_length-len(tokens))])
return proc_sentences

The function takes a set of tokenized sentences (i.e., each sentence as a list of tokens, not a string) and a maximum sequence length. Note that this is the maximum sequence length we used previously, not the new token length we discussed. ...