BERT Models for Italian and Portuguese
Learn about the architecture and different variants of the UmBERTo and BERTimbau.
We'll cover the following
UmBERTo for Italian
UmBERTo is the pre-trained BERT model for the Italian language by Musixmatch research. The UmBERTo model inherits the RoBERTa model architecture. The RoBERTa is essentially BERT with the following changes in pre-training:
Dynamic masking is used instead of static masking in the MLM task.
The NSP task is removed and trained using only the MLM task.
Training is undertaken with a large batch size.
Byte-level BPE is used as a tokenizer.
UmBERTo extends the RoBERTa architecture by using the SentencePiece tokenizer and WWM.
Variants of UmBERTo model
Researchers have released two pre-trained UmBERTo models:
umberto-wikipedia-uncased-v1
: Trained on the Italian Wikipedia corpus.umberto-commoncrawl-cased-v1
: Trained on the CommonCrawl dataset.
The pre-trained UmBERTo models can be downloaded from GitHub. We can also use the pre-trained UmBERTo model with the transformers
library, as shown here:
Get hands-on with 1400+ tech skills courses.