Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Translation output consists of unknown tokens.

See original GitHub issue

Hi, i trained MT model using sockeye docker. But the translation output was unknown tokens “<unk> <unk> <unk> <unk>”.

And i noticed something in the log of preparing data, that the number of unk tokens in source and target is high. I tried different values for the other parameters, but the numbers were the same.

[2021-09-26:12:51:05:INFO:sockeye.data_io:log] Tokens: source 207316620 target 210246432
[2021-09-26:12:51:05:INFO:sockeye.data_io:log] Number of <unk> tokens: source 188976085 target 192506410
[2021-09-26:12:51:05:INFO:sockeye.data_io:log] Vocabulary coverage: source 9% target 8%

Commands:

python -m learn-bpe -s 32000 < corpus.ar > bpe.codes.ar
python -m learn-bpe -s 32000 < corpus.en > bpe.codes.en
    
    
python -m apply-bpe -c bpe.codes.ar < corpus.ar > corpus.ar.bpe
python -m apply-bpe -c bpe.codes.en < corpus.en > corpus.en.bpe

python -m sockeye.prepare_data  -s corpus.en.bpe  -t corpus.ar.bpe  -o prepared_data --num-words 32000 

python -m sockeye.train -d train_data -vs data_dev_lc.BPE.en -vt data_dev_lc.BPE.ar -o model_3 --num-layers 6 --transformer-model-size 512   --transformer-attention-heads 8 --transformer-feed-forward-num-hidden 2048 --optimizer adam --batch-size  2000  --update-interval 2   --initial-learning-rate 1e-5 --learning-rate-reduce-factor 0.5  --learning-rate-reduce-num-not-improved 2 --max-num-checkpoint-not-improved 60 --checkpoint-interval 4000 --decode-and-evaluate 500 --label-smoothing 0.1 --seed 1 --device-ids 0 --weight-tying-type  none  --batch-type word


 python -m sockeye.translate  -i tarjama_test_lc.en.bpe -o tarjama_test_lc.hyp.bpe  -m model_2  --beam-size 5 --batch-size 64 --device-ids 0

What is the reason for output garbage translation? and how we can reduce the number of unk tokens in preparing the data? How can I solve this?

Issue Analytics

State:
Created 2 years ago
Comments:8 (5 by maintainers)

Top GitHub Comments

1reaction

NourKhdourcommented, Sep 30, 2021

Thanks, the problem with prepare_data was solved by the new release. But when i start training the model i receive the error below:

[INFO:sockeye.training] Training started.
Floating point exception (core dumped)

1reaction

fhiebercommented, Sep 30, 2021

I created a new release (2.3.22) where this bug should be fixed: https://github.com/awslabs/sockeye/releases/tag/2.3.22

Please let us know if the problem is solved then. Thanks!

Top Results From Across the Web

Machine translation transformer output - "unknown" tokens?

unk means the token is not present in the vocabulary. You need to use BPE or SentencePiece model to address this problem. –...

nlp - Machine translation transformer output - "unknown" tokens?

This is how I tokenized my data, I am using German to english for the translation task. from transformers import BertTokenizer bert_tokenizer_en ...

Is it a good idea to apply NER for translation #1106 - GitHub

I think trying to convert dates into one token and translate it ... Therefore, the translations never have unknown token (e.g., ) produced....

Addressing the Rare Word Problem in Neural Machine ...

This annotation enables us to translate every non-null unknown token. 3.2 Positional All Model (PosAll). The copyable model is limited by its inability...

Translation of Unknown Words in Low Resource Languages

to copy the unknown word in the translated output, which may work for named ... includes OOV words whose morphological variants were seen...