Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Decoding problem for char-based translation

See original GitHub issue

Hi,

I modified the wmt_ende_characters to translate Macedonian to English (bleu-score after training was 0.526888).

The input sentence is:

Kosovskiot proces na privatizaciјa se ispituva

Then the t2t_trainer command shows some weird output:

INFO:tensorflow:Restoring parameters from t2t_train/model.ckpt-250000
INFO:tensorflow:Inference results INPUT: Mquqxumkqv"rtqegu"pc"rtkxcvk|cekӚc"ug"kurkvwxc
INFO:tensorflow:Inference results OUTPUT: Mukwak.cwave.gurk.fe.ce.sce.gurkwe.ce.ce
INFO:tensorflow:Writing decodes into test.txt.transformer.transformer_base.beam4.alpha0.6.decodes

Tested with version 1.0.5 and 1.0.7. Is this a bug?

Issue Analytics

State:
Created 6 years ago
Comments:10 (9 by maintainers)

Top GitHub Comments

1reaction

lukaszkaisercommented, Jun 25, 2017

That would be wonderful, yes, we welcome a PR! And great thanks for all the python3 work too 😃.

1reaction

vthorsteinssoncommented, Jun 24, 2017

It is a bit strange that the character-based generators in wmt.py do not use text_encoder.ByteTextEncoder() for encoding the source and target strings to vectors, but simply do a raw conversion of character ordinal values to vectors. I am working on a PR that fixes this, and at least in preliminary testing the output looks much saner.