question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Suggestions on training on English to Vietnamese translation

See original GitHub issue

Hi , I am trying on English to Vietnamese translation using IWLST data from Stanford NLP which has 133K pairs. I want to replicate results presented here https://github.com/tensorflow/tensor2tensor/pull/611 where Transformer base architecture is used. I have a few quick questions

  • Can preparation of data be done same as prepare-iwslt14.sh or whether it is better if learn_bpe.py is done on a larger wikipedia dump and apply_bpe.py on the smaller corpus mentioned above. Are 133k sentences alone sufficient to learn bpe so that I can just use the pre-processing script or should go for latter case. Also since moses doesn’t support Vietnamese how to proceed about tokenization.

  • Whether to use --joined-dictionary or not during preprocessing. As I see here it is better to use joined dictionary if the alphabets are shared. In Vietnamese I felt there are more diacritics as compared to German where joined dictionary is used without ambiguity.

  • Also I used the below command on EN-DE to get good results for 4.5 million sentence pairs from WMT. CUDA_VISIBLE_DEVICES=0 python train.py data-bin/wmt16_en_de_bpe32k --arch transformer_wmt_en_de --share-all-embeddings --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 --lr 0.0007 --min-lr 1e-09 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --weight-decay 0.0 --max-tokens 4096 --save-dir checkpoints/en-de --update-freq 8 --no-progress-bar --log-format simple --keep-interval-updates 20 Since the data set is small whether I can use the same architecture or use IWLST architecture for DE-EN if so in any case do I need to change any hyper parameters.

I know the questions are naive but I think it would help for new users as well who are trying on Vietnamese. I am working on single gpu setup.Please give some suggestions or leads on this kind of task and dataset. Thanks.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:1
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

3reactions
edunovcommented, Jan 17, 2019

Hi @Raghava14

Unfortunately I can’t give exact answers to your questions without proper experimentation, but here are some thoughts:

  • Is it better to learn BPE on 133K parallel sentences or on a larger monolingual corpus: I tried both for another language pair (160K parallel sentences) and I didn’t see much difference, changing BPE vocab size has generally much bigger effect.

  • Whether to use --joined-dictionary: For low resource settings joined-dictionary seems to be beneficial even if vocabularies are not shared (assuming you also do --share-all-embeddings during training), if you don’t share vocab make sure you use --share-decoder-input-output-embed during training.

  • What architecture to pick: the pre-defined architectures we have like transformer_wmt_en_de or transformer_iwslt_de_en are tuned for particular datasets, that is we found that these architectures work best on the datasets in question. They tend to work well for other datasets of similar size, but if you want to squeeze every bit of performance, you need to tune your own architecture. That, unfortunately, requires a lot of experimentation. You’ll need to try various values for these parameters: https://github.com/pytorch/fairseq/blob/master/fairseq/models/transformer.py#L872-L879 and see what works best.

Hope that helps…

1reaction
jiachangliucommented, Oct 10, 2019

@Raghava14 Thank you very much. I used separate dictionaries. I will try --joined-dictionary to see if I can get better results.

Read more comments on GitHub >

github_iconTop Results From Across the Web

5 Best Practices For English To Vietnamese Translation ...
5 Best Practices for English to Vietnamese Translation Services · 1. Use Personal Pronouns Smartly · 2. Acknowledge the Differences in Regional ...
Read more >
Suggestions on training on English to Vietnamese translation
Hi , I am trying on English to Vietnamese translation using IWLST data from Stanford NLP which has 133K pairs. I want to...
Read more >
Vietnamese English Interpreting and Translating Classes
​​​The Vietnamese / English Interpretation and Translation program provides proper bilingual skills training by professional interpreters and translators.
Read more >
Tips on Translation from Vietnamese into English
Tips on Translation from Vietnamese into English. 1. Vietnamese grammar. Vietnamese grammar revision training is essential as it will help the student gain ......
Read more >
"training" in Vietnamese - Translate - Bab.la
Translation for 'training' in the free English-Vietnamese dictionary and many other Vietnamese translations.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found