Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

I train with translate_enzh_wmt32k, and bleu is only 1.25.What's the reason?I appreciate you telling me

See original GitHub issue

Description

I trained model with translate_enzh_wmt32k, but bleu is only 1.25. What’s the reason?I appreciate you can tell me, thank you!

Environment information

OS: linux

Steps to reproduce:

just flow the doc command

something just like this:
$ PROBLEM=translate_enzh_wmt32k
$ MODEL=transformer
$ HPARAMS=transformer_base_single_gpu
$ DATA_DIR=$HOME/t2t_data
$ TMP_DIR=/tmp/t2t_datagen
$ TRAIN_DIR=$HOME/t2t_train/$PROBLEM/$MODEL-$HPARAMS
$ T2T_USR_DIR=$HOME/t2t_usr_dir

$ t2t-trainer \
  --data_dir=$DATA_DIR \
  --problem=$PROBLEM \
  --model=$MODEL \
  --hparams_set=$HPARAMS \
  --output_dir=$TRAIN_DIR

Issue Analytics

State:
Created 5 years ago
Comments:16 (3 by maintainers)

Top GitHub Comments

4reactions

yynilcommented, Nov 3, 2018

The solution is pretty simple and straight forward.

Extract all unique characters in the corpus including symbols.
Use the characters as dictionary and just use the text token encoder.
Train the model using transformer_base configuration.
It took us 1.5 millions steps to get bleu score 22~23.

All we believe is that for unigram language like Chinese and Japanese, it’s unnecessary to do word cut any more because the deep neural network will learn how to connect characters as words better than any word cut library. The translation model proved my guess might be right. 发自我的 iPhone

在 2018年11月3日，20:34，ConnectDotz notifications@github.com 写道：

@yynil thx, did you just use a simple one-hot encoding then? is there an example (parameter?) on how to plug in our own tokenizer? would you care to share more detail? Actually, if the current implementation could not reliably produce a reasonable result(?), will you consider contributing your solution?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

1reaction

hpulfccommented, Sep 11, 2018

finally , you should have word segmentation for Chinese, then it will be suitable from SubTokenEncoder