Problem at Getting Word Predictions from the Roberta Model I Trained for Turkish
See original GitHub issueHi everyone, I followed this guide with some differences as mentioned here: https://github.com/pytorch/fairseq/issues/1186 to pretrain a Roberta model for Turkish. Below are the detailed steps I took: 1 - Downloaded a turkish corpus and created train.txt, valid.txt and test.txt files from the corpus. 2 - trained a sentencepiece bpe model and used it to create train.bpe, valid.bpe and test.bpe files using the following command :
for SPLIT in train valid test; do \
cat wikitext-103-raw/wiki.${SPLIT}.raw | spm_encode --model=<model_file> --output_format=piece > wikitext-103-raw/wiki.${SPLIT}.bpe
done
3 - Used following command and not specifying -srcdict I created a dict.txt
fairseq-preprocess \
--only-source \
--trainpref wikitext-103-raw/wiki.train.bpe \
--validpref wikitext-103-raw/wiki.valid.bpe \
--testpref wikitext-103-raw/wiki.test.bpe \
--destdir data-bin/wikitext-103 \
--workers 60
4 - Did the rest of the official guide for training.
The model training (with just a few epochs to check if everything is correct) was done without an error. The problem is when I try to get masked word prediction I get following errors:
1 - When using following command with output_format = piece to create bpe files
for SPLIT in train valid test; do \
cat wikitext-103-raw/wiki.${SPLIT}.raw | spm_encode --model=<model_file> --output_format=piece > wikitext-103-raw/wiki.${SPLIT}.bpe
done
I use this code snipe to get prediction:
import torch
from fairseq.models.roberta import RobertaModel
roberta = RobertaModel.from_pretrained('checkpoints', 'checkpoint_best.pt', 'data-bin/wikitext-103')
assert isinstance(roberta.model, torch.nn.Module)
roberta.fill_mask('Hadi bir masked_word yiyelim.', topk=3)
The error is:
ValueError Traceback (most recent call last) <ipython-input-13-867360f4059a> in <module>() 3 roberta = RobertaModel.from_pretrained(‘checkpoints’, ‘checkpoint_best.pt’, ‘data-bin/wikitext-103’) 4 assert isinstance(roberta.model, torch.nn.Module) ----> 5 roberta.fill_mask(‘Hadi bir <mask> yiyelim.’, topk=3)
3 frames /content/drive/My Drive/fairseq-master/fairseq/data/encoders/gpt2_bpe_utils.py in <listcomp>(.0) 112 113 def decode(self, tokens): –> 114 text = ‘’.join([self.decoder[token] for token in tokens]) 115 text = bytearray([self.byte_decoder[c] for c in text]).decode(‘utf-8’, errors=self.errors) 116 return text
ValueError: invalid literal for int() with base 10: 'larıyla
2 - When using following command with output_format = id to create bpe files
for SPLIT in train valid test; do \
cat wikitext-103-raw/wiki.${SPLIT}.raw | spm_encode --model=<model_file> --output_format=id > wikitext-103-raw/wiki.${SPLIT}.bpe
done
The error is I get english predictions when I actually pretrained my model with Turkish corpus. Following are the prediction I get for a sample:
import torch
from fairseq.models.roberta import RobertaModel
roberta = RobertaModel.from_pretrained('checkpoints', 'checkpoint_best.pt', 'data-bin/wikitext-103')
assert isinstance(roberta.model, torch.nn.Module)
roberta.fill_mask('Hadi bir masked_word yiyelim.', topk=3)
loading archive file checkpoints loading archive file data-bin/wikitext-103 | dictionary: 31544 types [(‘Hadi bir revealing yiyelim.’, 0.0002395595656707883, ’ revealing’), (‘Hadi birlington yiyelim.’, 0.0002280160115333274, ‘lington’), (‘Hadi bir light yiyelim.’, 0.0001991547178477049, ’ light’)]
Any chance you could help, just asking because I followed the issue I mentioned above which you gave the replies @lematt1991
Issue Analytics
- State:
- Created 4 years ago
- Comments:9 (4 by maintainers)
Top GitHub Comments
Thank you so much, now it works. This was really really helpful. I appreciate your help. Closing the issue since now it is solved.
Do you mean the bpe model I trained to create bpe files? It is not in the directory but I will add and try it again