Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Parallel datasets

See original GitHub issue

Hi, I am trying to create a POC using CodeGen to translate code written in vb to Java and vice-versa. I downloaded the training data for vb and java using Google BigQuery. Also, I have completed the preprocessing step using commands:

python -m codegen_sources.preprocessing.preprocess /content/Facebook_CodeGen/fastTrainingData_monoFunc_updated_v1 --langs vb java --mode=monolingual_functions --local=True --bpe_mode=fast --train_splits=10 --percent_test_valid=10
python -m codegen_sources.preprocessing.preprocess /content/Facebook_CodeGen/fastTrainingData_monoFunc_updated_v1 --langs vb java --mode=monolingual --local=True --bpe_mode=fast --train_splits=10 --percent_test_valid=10

As a result, the following files were created inside the folder XLM-syml:

test.[java_cl|java_monolingual|java_sa|vb_cl|vb_monoligual|vb_sa].pth
train.[java_cl|java_monolingual|java_sa|vb_cl|vb_monoligual|vb_sa [0-9]].pth
valid.[java_cl|java_monolingual|java_sa|vb_cl|vb_monoligual|vb_sa].pth

Post that, I trained the MLM model using the following command: python codegen_sources/model/train.py --exp_name mlm_vb_java_fast_mono_updated_v0 --dump_path '/content/Facebook_CodeGen/dumpPath_fast_mono_updated' --data_path '/content/Facebook_CodeGen/fastTrainingData_monoFunc_updated_v1/XLM-syml' --mlm_steps 'vb_sa,java_sa' --add_eof_to_stream true --word_mask_keep_rand '0.8,0.1,0.1' --word_pred '0.15' --encoder_only true --n_layers 6 --emb_dim 1024 --n_heads 8 --lgs 'vb_sa-java_sa' --max_vocab 64000 --gelu_activation false --roberta_mode false --amp 2 --fp16 true --batch_size 16 --bptt 512 --epoch_size 200 --max_epoch 100000 --split_data_accross_gpu global --optimizer 'adam_inverse_sqrt,warmup_updates=10000,lr=0.0001,weight_decay=0.01' --save_periodic 0 --validation_metrics _valid_mlm_ppl --stopping_criterion '_valid_mlm_ppl,10'

However, when I am trying to train transcoder model using following command, I am getting AssertionError: /content/Facebook_CodeGen/fastTrainingData_monoFunc_updated_v1/XLM-syml/valid.java_sa-vb_sa.java_sa.0.pth error. Command: python codegen_sources/model/train.py --exp_name transcoder_vb_java_updated_v1 --dump_path '/content/drive/MyDrive/dumpPath_updated_transcoder_v0' --data_path '/content/Facebook_CodeGen/fastTrainingData_monoFunc_updated_v1/XLM-syml' --split_data_accross_gpu local --bt_steps 'vb_sa-java_sa-vb_sa,java_sa-vb_sa-java_sa' --ae_steps 'vb_sa,java_sa' --lambda_ae '0:1,30000:0.1,100000:0' --word_shuffle 3 --word_dropout '0.1' --word_blank '0.3' --encoder_only False --n_layers 0 --n_layers_encoder 6 --n_layers_decoder 6 --emb_dim 1024 --n_heads 8 --lgs 'java_sa-vb_sa' --max_vocab 64000 --gelu_activation false --roberta_mode false --reload_model '/content/Facebook_CodeGen/dumpPath_fast_mono_updated/mlm_vb_java_fast_mono_updated_v1/fkmc1busqw/checkpoint.pth,/content/Facebook_CodeGen/dumpPath_fast_mono_updated/mlm_vb_java_fast_mono_updated_v1/fkmc1busqw/checkpoint.pth' --reload_encoder_for_decoder true --amp 2 --fp16 true --tokens_per_batch 3000 --group_by_size true --max_batch_size 128 --epoch_size 100 --max_epoch 10000000 --split_data_accross_gpu global --optimizer 'adam_inverse_sqrt,warmup_updates=10000,lr=0.0001,weight_decay=0.01' --eval_bleu true --eval_computation true --has_sentences_ids true --generate_hypothesis true --save_periodic 1 --validation_metrics 'valid_vb_-java_mt_comp_acc' --lgs_mapping 'vb_sa:vb,java_sa:java'

Could you please help me as to how do I get these parallel datasets? Also, is there something/some step that I am missing or doing incorrectly?

Issue Analytics

State:
Created 2 years ago
Comments:23 (8 by maintainers)

Top GitHub Comments

1reaction

brozicommented, Nov 2, 2021

use the vocab and codes of step 1 (preprocessing in monolingual mode) for step 2 and 3.
yes

0reactions

dineshkhcommented, Apr 22, 2022

Hi @prnk04 @brozi ,

Just to be sure is it correct that test.java_cl-java_sa.java_cl.pth file is nothing but a symbolic link to java.test.cl.bpe.pth file ?

Top Results From Across the Web

Machine Learning Datasets - Papers With Code

Cherokee-English Parallel Dataset is a low-resource dataset of 14,151 pairs of sentences with around 313K English tokens and 206K Cherokee tokens.

OPUS - an open source parallel corpus

the open parallel corpus. OPUS is a growing collection of translated texts from the web. In the OPUS project we try to convert...

Public Datasets - Parallel Domain

This dataset was used to create the results in the paper Learning To Track With Object Permanence. [https://arxiv.org/abs/2103.14258] This research required ...

Debugging parallel Datasets transformations

I've recently spent some time scratching my head over some tricky bugs in parallel dataset processing (with .map()).

BiPaR: A Bilingual Parallel Dataset for Multilingual and Cross ...

... parallel novel-style machine reading comprehension (MRC) dataset, ... between BiPaR and existing reading comprehension datasets is that ...