🤧belabBERT: Repository for a new Dutch language model based on the RoBERTa architecture
Note the current release of this model is not fully trained yet, the fully trained version of the model will be released later this month
A new Dutch RoBERTa based language model, pretrained on the Dutch unshuffled OSCAR corpus using the masked language modeling (MLM) objective.
The model is case sensitive and includes punctuation. The huggingface🤗 transformer library was used for the pretraining process
You can use this model directly with a pipeline for masked language modeling:
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='jwouts/belabBERT_115k', tokenizer='jwouts/belabBERT_115k')
>>> unmasker("Hoi ik ben een <mask> model.")
[{'sequence': '<s>Hoi ik ben een dames model.</s>',
'score': 0.05529128015041351,
'token': 3079,
'token_str': 'Ġdames'},
{'sequence': '<s>Hoi ik ben een kleding model.</s>',
'score': 0.042242035269737244,
'token': 3333,
'token_str': 'Ġkleding'},
{'sequence': '<s>Hoi ik ben een mode model.</s>',
'score': 0.04132745787501335,
'token': 6541,
'token_str': 'Ġmode'},
{'sequence': '<s>Hoi ik ben een horloge model.</s>',
'score': 0.029257522895932198,
'token': 7196,
'token_str': 'Ġhorloge'},
{'sequence': '<s>Hoi ik ben een sportief model.</s>',
'score': 0.028365155681967735,
'token': 15357,
'token_str': 'Ġsportief'}]
Here is how to use this model to get the features of a given text in PyTorch:
from transformers import RobertaTokenizer, RobertaModel
tokenizer = RobertaTokenizer.from_pretrained('jwouts/belabBERT_115k')
model = RobertaModel.from_pretrained('jwouts/belabBERT_115k')
text = "Vervang deze tekst."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
and in TensorFlow:
from transformers import RobertaTokenizer, TFRobertaModel
tokenizer = RobertaTokenizer.from_pretrained('jwouts/belabBERT_115k')
model = TFRobertaModel.from_pretrained('jwouts/belabBERT_115k')
text = "Vervang deze tekst."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
belabBERT was pretrained on the Dutch version of the unshuffled OSCAR corpus, the current state-of-the-art Dutch BERT model RobBERT was trained on the shuffled version of this corpus.
After deduplication the size of this corpus was 32GB
The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 50.000. The inputs of
the model take pieces of 512 contiguous token that may span over documents. The tokenizer was trained on Dutch texts, The beginning of a new document is marked
with <s>
and the end of one by </s>
The details of the masking procedure for each sentence are the following:
<mask>
.Contrary to BERT, the masking is done dynamically during pretraining (e.g., it changes at each epoch and is not fixed).
The model was trained on 4 Titan RTX GPUs for 115K steps with a batch size of 1.3K and a sequence length of 512. The
optimizer used is Adam with a learning rate of 5e-5, ,
and
, a weight decay of 0.01, learning rate warmup for 20000 steps and linear decay of the learning
rate after.
Due to credit limitations on the HPC I was not able to finetune belabBERT on the common evaluation tasks.
However, belabBERT is likely to outperform the current state-of-the-art RobBERT since belabBERT uses a Dutch tokenizer where RobBERT is trained with an English tokenizer.
On top of that, RobBERT is trained on a shuffled corpus (at line level) while belabBERT is trained on the unshuffled version of the same corpus, this makes belabBERT more capable to deal with long sequences of text.
This work was carried out on the Dutch national e-infrastructure with the support of SURF Cooperative.
Thanks to the builders of the OSCAR corpus for giving me permission to use the unshuffled Dutch version
A major shout out to the brilliant @elslooo for the name of this model 🤗
Thanks to the model card of RoBERTa for the README format/text.