NER with Character and Token Embeddings
Learn to implement NER with character and token embeddings.
Nowadays, recurrent models used to solve the NER task are much more sophisticated than having just a single embedding layer and an RNN model. They involve using more advanced recurrent models like long short-term memory (LSTM), gated recurrent units (GRUs), etc. We’ll set aside the discussion about these advanced models. Here, we’ll focus our discussion on a technique that provides the model embeddings at multiple scales, enabling it to understand language better—that is, instead of relying only on token embeddings, also use character embeddings. Then, a token embedding is generated with the character embedding by shifting a convolutional window over the characters in the token.
Using convolution to generate token embeddings
A combination of character embeddings and a convolutional kernel can be used to generate token embeddings. The method will be as follows:
Pad each token (e.g., word) to a predefined length.
Look up the character embeddings for the characters in the token from an embedding layer.
Shift a convolutional kernel over the sequence of character embeddings to generate a token embedding.
The very first thing we need to do is analyze the statistics around how many characters there are for a token in our corpus. Similar to how we did it previously, we can do this with pandas:
vocab_ser = pd.Series(pd.Series(train_sentences).str.split().explode().unique())vocab_ser.str.len().describe(percentiles=[0.05, 0.95])
In computing vocab_ser
, the first part (i.e., pd.Series(train_sentences).str.split()
) will result in a pandas Series
object whose elements are a list of tokens (each token in the sentence is an item of that list). Next, explode()
will convert the Series
of a list of tokens into a Series
of tokens by converting each token into a separate item in the Series
. Finally, we take only the unique tokens in that Series
. Here, we end up with a pandas Series
object where each item is a unique token.
We’ll now use the str.len()
function to get the length of each token (i.e., the number of characters) and look at the 95% percentile in that. We’ll get the following:
count 23623.000000mean 6.832705std 2.749288min 1.0000005% 3.00000050% 7.00000095% 12.000000max 61.000000dtype: float64
We can see around 95% of our words have less than or equal to 12 characters.
Next, we’ll write a function to pad shorter sentences:
def prepare_corpus_for_char_embeddings(tokenized_sentences, max_seq_length):""" Pads each sequence to a maximum length """proc_sentences = []for tokens in tokenized_sentences:if len(tokens) >= max_seq_length:proc_sentences.append([[t] for t in tokens[:max_seq_length]])else:proc_sentences.append([[t] for t in tokens+['']*(max_seq_length-len(tokens))])return proc_sentences
The function takes a set of tokenized sentences (i.e., each sentence as a list of tokens, not a string) and a maximum sequence length. Note that this is the maximum sequence length we used previously, not the new token length we discussed. ...