Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Question regarding Backtranslation

See original GitHub issue

Hi,

I have a basic question to understand why the backtranslation works in this scenario. Typically in NLP, we collect some parallel data to train Transformer-like models and then use backtranslation (BT) on a large collection of monolingual data.

In contrast, TransCoder is first gone through a pre-training stage and then trained via BT. Since, TransCoder does not have any idea about cross-language generation, at the beginning of BT, TransCoder presumably would generate the sequence in the same language (from Java input to Java output, instead of python output). So, feeding the generated sequence to translate back to the original sequence is not going to help the model in learning translation. So, how backtranslation provides the learning bias to perform translation?

Recently, I tried to apply BT to our model, called PLBART to teach it to perform translation. However, at the very beginning of BT training, when I checked what PLBART generates for a given Java input, I saw it generates exactly the input sequence although the generation is done based on a prefix token for the target language python. For example,

# input
static public int staticGetLargeMemoryClass ( ) { String vmHeapSize = SystemProperties . get ( " dalvik . vm . heapsize " , "16m " ) ; return Integer . parseInt ( vmHeapSize . substring ( 0 , vmHeapSize . length ( ) - 1 ) ) ; } [java] 

# output
[python] public int staticGetLargeMemoryClass ( ) { String vmHeapSize = SystemProperties . get ( " dalvik . vm . heapsize " , "16m " ) ; return Integer . parseInt ( vmHeapSize . substring ( 0 , vmHeapSize . length ( ) - 1 ) ) ; }

As you can see above, exactly the same sequence is generated. PLBART is pre-trained via Denoising Autoencoding (DAE), thus it doesn’t have any clue about cross-language generation. I am curious, how does TransCoder learn from BT?

If I am not wrong, TransCoder uses language embedding with each input token (REF). Do you think that can make a difference? Also, can you shed light on the TransCoder structure? It seems like TransCoder does not have a typical sequence-to-sequence architecture.

Issue Analytics

State:
Created 2 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

brozicommented, Sep 15, 2021

I agree that adding a language embedding for every token can provide a stronger signal, that’s the reason why we chose to do it this way. I would have thought that adding a language token at the beginning of the sentence would be enough to generate a function in the target sentence since the model can learn to pay attention to this token. It should at least be able to learn to generate a “def” token instead of “public static” after a [python] token. I think the main difference could be that you are training with the denoising objective until convergence at first and then with the BT objective while we trained both at the same time. Then you can end up in a bad local minima where the model just copies the input sentence and ignore the language token. If you train with the BT objective at the same time, your model will learn early that the [python] token is followed by something like “def” (or import if you train on whole files) and it should work better. Actually we trained a baseline with denoising for a revision of DOBF and we only reloaded the encoder to make the unsupervised translation work. Otherwise we also get stuck in a state where the model copies the input sentence.

About the part of our code you linked to, we use the same TransformerModel class for our encoders and decoders and just add a class attribute is_decoder to know what kind of transformer it is and whether we should do cross attention.

0reactions

wasiahmadcommented, Sep 15, 2021

Thanks a lot for your comments. Perhaps, simultaneous training via DAE and BT is the key factor.

Top Results From Across the Web

Back-Translation - Science topic

Questions related to Back-Translation. Reza Jahanshahi. asked a question related to Back-Translation. What journal do you think I should submit to? Question.

Why back translation is inadequate to assess quality in ...

Back translation in surveys is not sufficient to evaluate translation quality. More sophisticated translation designs are required for robust surveys.

The Back Translation method: what is it and why use it?

Back translation is a 3-step translation quality control method comprising: ... Its purpose is to confirm the translation you're about to use is...

7 Things You Need to Know about Back Translation

Are there any Disadvantages to Using Back Translation ? This has been a nagging question by people seeking translation and localization services, ...

The What And Why Of Back Translation And Reconciliation

Back translations help to evaluate equivalence of meaning between the source and target texts.