question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Parallel datasets

See original GitHub issue

Hi, I am trying to create a POC using CodeGen to translate code written in vb to Java and vice-versa. I downloaded the training data for vb and java using Google BigQuery. Also, I have completed the preprocessing step using commands:

  1. python -m codegen_sources.preprocessing.preprocess /content/Facebook_CodeGen/fastTrainingData_monoFunc_updated_v1 --langs vb java --mode=monolingual_functions --local=True --bpe_mode=fast --train_splits=10 --percent_test_valid=10
  2. python -m codegen_sources.preprocessing.preprocess /content/Facebook_CodeGen/fastTrainingData_monoFunc_updated_v1 --langs vb java --mode=monolingual --local=True --bpe_mode=fast --train_splits=10 --percent_test_valid=10

As a result, the following files were created inside the folder XLM-syml:

  1. test.[java_cl|java_monolingual|java_sa|vb_cl|vb_monoligual|vb_sa].pth
  2. train.[java_cl|java_monolingual|java_sa|vb_cl|vb_monoligual|vb_sa [0-9]].pth
  3. valid.[java_cl|java_monolingual|java_sa|vb_cl|vb_monoligual|vb_sa].pth

Post that, I trained the MLM model using the following command: python codegen_sources/model/train.py --exp_name mlm_vb_java_fast_mono_updated_v0 --dump_path '/content/Facebook_CodeGen/dumpPath_fast_mono_updated' --data_path '/content/Facebook_CodeGen/fastTrainingData_monoFunc_updated_v1/XLM-syml' --mlm_steps 'vb_sa,java_sa' --add_eof_to_stream true --word_mask_keep_rand '0.8,0.1,0.1' --word_pred '0.15' --encoder_only true --n_layers 6 --emb_dim 1024 --n_heads 8 --lgs 'vb_sa-java_sa' --max_vocab 64000 --gelu_activation false --roberta_mode false --amp 2 --fp16 true --batch_size 16 --bptt 512 --epoch_size 200 --max_epoch 100000 --split_data_accross_gpu global --optimizer 'adam_inverse_sqrt,warmup_updates=10000,lr=0.0001,weight_decay=0.01' --save_periodic 0 --validation_metrics _valid_mlm_ppl --stopping_criterion '_valid_mlm_ppl,10'

However, when I am trying to train transcoder model using following command, I am getting AssertionError: /content/Facebook_CodeGen/fastTrainingData_monoFunc_updated_v1/XLM-syml/valid.java_sa-vb_sa.java_sa.0.pth error. Command: python codegen_sources/model/train.py --exp_name transcoder_vb_java_updated_v1 --dump_path '/content/drive/MyDrive/dumpPath_updated_transcoder_v0' --data_path '/content/Facebook_CodeGen/fastTrainingData_monoFunc_updated_v1/XLM-syml' --split_data_accross_gpu local --bt_steps 'vb_sa-java_sa-vb_sa,java_sa-vb_sa-java_sa' --ae_steps 'vb_sa,java_sa' --lambda_ae '0:1,30000:0.1,100000:0' --word_shuffle 3 --word_dropout '0.1' --word_blank '0.3' --encoder_only False --n_layers 0 --n_layers_encoder 6 --n_layers_decoder 6 --emb_dim 1024 --n_heads 8 --lgs 'java_sa-vb_sa' --max_vocab 64000 --gelu_activation false --roberta_mode false --reload_model '/content/Facebook_CodeGen/dumpPath_fast_mono_updated/mlm_vb_java_fast_mono_updated_v1/fkmc1busqw/checkpoint.pth,/content/Facebook_CodeGen/dumpPath_fast_mono_updated/mlm_vb_java_fast_mono_updated_v1/fkmc1busqw/checkpoint.pth' --reload_encoder_for_decoder true --amp 2 --fp16 true --tokens_per_batch 3000 --group_by_size true --max_batch_size 128 --epoch_size 100 --max_epoch 10000000 --split_data_accross_gpu global --optimizer 'adam_inverse_sqrt,warmup_updates=10000,lr=0.0001,weight_decay=0.01' --eval_bleu true --eval_computation true --has_sentences_ids true --generate_hypothesis true --save_periodic 1 --validation_metrics 'valid_vb_-java_mt_comp_acc' --lgs_mapping 'vb_sa:vb,java_sa:java'

Could you please help me as to how do I get these parallel datasets? Also, is there something/some step that I am missing or doing incorrectly?

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:23 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
brozicommented, Nov 2, 2021
  1. use the vocab and codes of step 1 (preprocessing in monolingual mode) for step 2 and 3.
  2. yes
0reactions
dineshkhcommented, Apr 22, 2022

Hi @prnk04 @brozi ,

Just to be sure is it correct that test.java_cl-java_sa.java_cl.pth file is nothing but a symbolic link to java.test.cl.bpe.pth file ?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Machine Learning Datasets - Papers With Code
Cherokee-English Parallel Dataset is a low-resource dataset of 14,151 pairs of sentences with around 313K English tokens and 206K Cherokee tokens.
Read more >
OPUS - an open source parallel corpus
the open parallel corpus. OPUS is a growing collection of translated texts from the web. In the OPUS project we try to convert...
Read more >
Public Datasets - Parallel Domain
This dataset was used to create the results in the paper Learning To Track With Object Permanence. [https://arxiv.org/abs/2103.14258] This research required ...
Read more >
Debugging parallel Datasets transformations
I've recently spent some time scratching my head over some tricky bugs in parallel dataset processing (with .map()).
Read more >
BiPaR: A Bilingual Parallel Dataset for Multilingual and Cross ...
... parallel novel-style machine reading comprehension (MRC) dataset, ... between BiPaR and existing reading comprehension datasets is that ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found