Translation output consists of unknown tokens.
See original GitHub issueHi, i trained MT model using sockeye docker. But the translation output was unknown tokens “<unk> <unk> <unk> <unk>”.
And i noticed something in the log of preparing data, that the number of unk tokens in source and target is high. I tried different values for the other parameters, but the numbers were the same.
[2021-09-26:12:51:05:INFO:sockeye.data_io:log] Tokens: source 207316620 target 210246432
[2021-09-26:12:51:05:INFO:sockeye.data_io:log] Number of <unk> tokens: source 188976085 target 192506410
[2021-09-26:12:51:05:INFO:sockeye.data_io:log] Vocabulary coverage: source 9% target 8%
Commands:
python -m learn-bpe -s 32000 < corpus.ar > bpe.codes.ar
python -m learn-bpe -s 32000 < corpus.en > bpe.codes.en
python -m apply-bpe -c bpe.codes.ar < corpus.ar > corpus.ar.bpe
python -m apply-bpe -c bpe.codes.en < corpus.en > corpus.en.bpe
python -m sockeye.prepare_data -s corpus.en.bpe -t corpus.ar.bpe -o prepared_data --num-words 32000
python -m sockeye.train -d train_data -vs data_dev_lc.BPE.en -vt data_dev_lc.BPE.ar -o model_3 --num-layers 6 --transformer-model-size 512 --transformer-attention-heads 8 --transformer-feed-forward-num-hidden 2048 --optimizer adam --batch-size 2000 --update-interval 2 --initial-learning-rate 1e-5 --learning-rate-reduce-factor 0.5 --learning-rate-reduce-num-not-improved 2 --max-num-checkpoint-not-improved 60 --checkpoint-interval 4000 --decode-and-evaluate 500 --label-smoothing 0.1 --seed 1 --device-ids 0 --weight-tying-type none --batch-type word
python -m sockeye.translate -i tarjama_test_lc.en.bpe -o tarjama_test_lc.hyp.bpe -m model_2 --beam-size 5 --batch-size 64 --device-ids 0
What is the reason for output garbage translation? and how we can reduce the number of unk tokens in preparing the data? How can I solve this?
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (5 by maintainers)
Top Results From Across the Web
Machine translation transformer output - "unknown" tokens?
unk means the token is not present in the vocabulary. You need to use BPE or SentencePiece model to address this problem. –...
Read more >nlp - Machine translation transformer output - "unknown" tokens?
This is how I tokenized my data, I am using German to english for the translation task. from transformers import BertTokenizer bert_tokenizer_en ...
Read more >Is it a good idea to apply NER for translation #1106 - GitHub
I think trying to convert dates into one token and translate it ... Therefore, the translations never have unknown token (e.g., ) produced....
Read more >Addressing the Rare Word Problem in Neural Machine ...
This annotation enables us to translate every non-null unknown token. 3.2 Positional All Model (PosAll). The copyable model is limited by its inability...
Read more >Translation of Unknown Words in Low Resource Languages
to copy the unknown word in the translated output, which may work for named ... includes OOV words whose morphological variants were seen...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

Thanks, the problem with prepare_data was solved by the new release. But when i start training the model i receive the error below:
I created a new release (2.3.22) where this bug should be fixed: https://github.com/awslabs/sockeye/releases/tag/2.3.22
Please let us know if the problem is solved then. Thanks!