Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Problem at Getting Word Predictions from the Roberta Model I Trained for Turkish

See original GitHub issue

Hi everyone, I followed this guide with some differences as mentioned here: https://github.com/pytorch/fairseq/issues/1186 to pretrain a Roberta model for Turkish. Below are the detailed steps I took: 1 - Downloaded a turkish corpus and created train.txt, valid.txt and test.txt files from the corpus. 2 - trained a sentencepiece bpe model and used it to create train.bpe, valid.bpe and test.bpe files using the following command :

for SPLIT in train valid test; do \
    cat wikitext-103-raw/wiki.${SPLIT}.raw | spm_encode --model=<model_file> --output_format=piece > wikitext-103-raw/wiki.${SPLIT}.bpe
done

3 - Used following command and not specifying -srcdict I created a dict.txt

fairseq-preprocess \
    --only-source \
    --trainpref wikitext-103-raw/wiki.train.bpe \
    --validpref wikitext-103-raw/wiki.valid.bpe \
    --testpref wikitext-103-raw/wiki.test.bpe \
    --destdir data-bin/wikitext-103 \
    --workers 60

4 - Did the rest of the official guide for training.

The model training (with just a few epochs to check if everything is correct) was done without an error. The problem is when I try to get masked word prediction I get following errors:

1 - When using following command with output_format = piece to create bpe files

for SPLIT in train valid test; do \
    cat wikitext-103-raw/wiki.${SPLIT}.raw | spm_encode --model=<model_file> --output_format=piece > wikitext-103-raw/wiki.${SPLIT}.bpe
done

I use this code snipe to get prediction:

import torch 
from fairseq.models.roberta import RobertaModel
roberta = RobertaModel.from_pretrained('checkpoints', 'checkpoint_best.pt', 'data-bin/wikitext-103')
assert isinstance(roberta.model, torch.nn.Module)
roberta.fill_mask('Hadi bir masked_word yiyelim.', topk=3)

The error is:

ValueError Traceback (most recent call last) <ipython-input-13-867360f4059a> in <module>() 3 roberta = RobertaModel.from_pretrained(‘checkpoints’, ‘checkpoint_best.pt’, ‘data-bin/wikitext-103’) 4 assert isinstance(roberta.model, torch.nn.Module) ----> 5 roberta.fill_mask(‘Hadi bir <mask> yiyelim.’, topk=3)

3 frames /content/drive/My Drive/fairseq-master/fairseq/data/encoders/gpt2_bpe_utils.py in <listcomp>(.0) 112 113 def decode(self, tokens): –> 114 text = ‘’.join([self.decoder[token] for token in tokens]) 115 text = bytearray([self.byte_decoder[c] for c in text]).decode(‘utf-8’, errors=self.errors) 116 return text

ValueError: invalid literal for int() with base 10: 'larıyla

2 - When using following command with output_format = id to create bpe files

for SPLIT in train valid test; do \
    cat wikitext-103-raw/wiki.${SPLIT}.raw | spm_encode --model=<model_file> --output_format=id > wikitext-103-raw/wiki.${SPLIT}.bpe
done

The error is I get english predictions when I actually pretrained my model with Turkish corpus. Following are the prediction I get for a sample:

import torch 
from fairseq.models.roberta import RobertaModel
roberta = RobertaModel.from_pretrained('checkpoints', 'checkpoint_best.pt', 'data-bin/wikitext-103')
assert isinstance(roberta.model, torch.nn.Module)
roberta.fill_mask('Hadi bir masked_word yiyelim.', topk=3)

loading archive file checkpoints loading archive file data-bin/wikitext-103 | dictionary: 31544 types [(‘Hadi bir revealing yiyelim.’, 0.0002395595656707883, ’ revealing’), (‘Hadi birlington yiyelim.’, 0.0002280160115333274, ‘lington’), (‘Hadi bir light yiyelim.’, 0.0001991547178477049, ’ light’)]

Any chance you could help, just asking because I followed the issue I mentioned above which you gave the replies @lematt1991

Issue Analytics

State:
Created 4 years ago
Comments:9 (4 by maintainers)

Top GitHub Comments

1reaction

ceatlinarcommented, Nov 23, 2019

Thank you so much, now it works. This was really really helpful. I appreciate your help. Closing the issue since now it is solved.

1reaction

ceatlinarcommented, Nov 23, 2019

Do you mean the bpe model I trained to create bpe files? It is not in the directory but I will add and try it again

Top Results From Across the Web

Problem at Getting Word Predictions from the Roberta Model I ...

Hi everyone, I followed this guide with some differences as mentioned here: #1186 to pretrain a Roberta model for Turkish.

RoBERTa - Hugging Face

We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training...

Python: BERT Error - Some weights of the model checkpoint at ...

weight', 'cls.predictions.bias'] - This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task ...

Train A XLM Roberta model for Text Classification on Pytorch

By pre-trained on 100 languages, XLM Roberta has powerful vocabulary to cover many more non-english words. In this post, I want to give...

Basics of BERT and XLM-RoBERTa - PyTorch - Kaggle

During training the model replaces each token with a corresponding word embedding vector. Each word vector has a length of 768.