question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Translation output consists of unknown tokens.

See original GitHub issue

Hi, i trained MT model using sockeye docker. But the translation output was unknown tokens “<unk> <unk> <unk> <unk>”.

And i noticed something in the log of preparing data, that the number of unk tokens in source and target is high. I tried different values for the other parameters, but the numbers were the same.

[2021-09-26:12:51:05:INFO:sockeye.data_io:log] Tokens: source 207316620 target 210246432
[2021-09-26:12:51:05:INFO:sockeye.data_io:log] Number of <unk> tokens: source 188976085 target 192506410
[2021-09-26:12:51:05:INFO:sockeye.data_io:log] Vocabulary coverage: source 9% target 8%

Commands:

python -m learn-bpe -s 32000 < corpus.ar > bpe.codes.ar
python -m learn-bpe -s 32000 < corpus.en > bpe.codes.en
    
    
python -m apply-bpe -c bpe.codes.ar < corpus.ar > corpus.ar.bpe
python -m apply-bpe -c bpe.codes.en < corpus.en > corpus.en.bpe

python -m sockeye.prepare_data  -s corpus.en.bpe  -t corpus.ar.bpe  -o prepared_data --num-words 32000 

python -m sockeye.train -d train_data -vs data_dev_lc.BPE.en -vt data_dev_lc.BPE.ar -o model_3 --num-layers 6 --transformer-model-size 512   --transformer-attention-heads 8 --transformer-feed-forward-num-hidden 2048 --optimizer adam --batch-size  2000  --update-interval 2   --initial-learning-rate 1e-5 --learning-rate-reduce-factor 0.5  --learning-rate-reduce-num-not-improved 2 --max-num-checkpoint-not-improved 60 --checkpoint-interval 4000 --decode-and-evaluate 500 --label-smoothing 0.1 --seed 1 --device-ids 0 --weight-tying-type  none  --batch-type word


 python -m sockeye.translate  -i tarjama_test_lc.en.bpe -o tarjama_test_lc.hyp.bpe  -m model_2  --beam-size 5 --batch-size 64 --device-ids 0

What is the reason for output garbage translation? and how we can reduce the number of unk tokens in preparing the data? How can I solve this?

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:8 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
NourKhdourcommented, Sep 30, 2021

Thanks, the problem with prepare_data was solved by the new release. But when i start training the model i receive the error below:

[INFO:sockeye.training] Training started.
Floating point exception (core dumped)
1reaction
fhiebercommented, Sep 30, 2021

I created a new release (2.3.22) where this bug should be fixed: https://github.com/awslabs/sockeye/releases/tag/2.3.22

Please let us know if the problem is solved then. Thanks!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Machine translation transformer output - "unknown" tokens?
unk means the token is not present in the vocabulary. You need to use BPE or SentencePiece model to address this problem. –...
Read more >
nlp - Machine translation transformer output - "unknown" tokens?
This is how I tokenized my data, I am using German to english for the translation task. from transformers import BertTokenizer bert_tokenizer_en ...
Read more >
Is it a good idea to apply NER for translation #1106 - GitHub
I think trying to convert dates into one token and translate it ... Therefore, the translations never have unknown token (e.g., ) produced....
Read more >
Addressing the Rare Word Problem in Neural Machine ...
This annotation enables us to translate every non-null unknown token. 3.2 Positional All Model (PosAll). The copyable model is limited by its inability...
Read more >
Translation of Unknown Words in Low Resource Languages
to copy the unknown word in the translated output, which may work for named ... includes OOV words whose morphological variants were seen...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found