Question regarding Backtranslation
See original GitHub issueHi,
I have a basic question to understand why the backtranslation works in this scenario. Typically in NLP, we collect some parallel data to train Transformer-like models and then use backtranslation (BT) on a large collection of monolingual data.
In contrast, TransCoder is first gone through a pre-training stage and then trained via BT. Since, TransCoder does not have any idea about cross-language generation, at the beginning of BT, TransCoder presumably would generate the sequence in the same language (from Java input to Java output, instead of python output). So, feeding the generated sequence to translate back to the original sequence is not going to help the model in learning translation. So, how backtranslation provides the learning bias to perform translation?
Recently, I tried to apply BT to our model, called PLBART to teach it to perform translation. However, at the very beginning of BT training, when I checked what PLBART generates for a given Java input, I saw it generates exactly the input sequence although the generation is done based on a prefix token for the target language python. For example,
# input
static public int staticGetLargeMemoryClass ( ) { String vmHeapSize = SystemProperties . get ( " dalvik . vm . heapsize " , "16m " ) ; return Integer . parseInt ( vmHeapSize . substring ( 0 , vmHeapSize . length ( ) - 1 ) ) ; } [java]
# output
[python] public int staticGetLargeMemoryClass ( ) { String vmHeapSize = SystemProperties . get ( " dalvik . vm . heapsize " , "16m " ) ; return Integer . parseInt ( vmHeapSize . substring ( 0 , vmHeapSize . length ( ) - 1 ) ) ; }
As you can see above, exactly the same sequence is generated. PLBART is pre-trained via Denoising Autoencoding (DAE), thus it doesn’t have any clue about cross-language generation. I am curious, how does TransCoder learn from BT?
If I am not wrong, TransCoder uses language embedding with each input token (REF). Do you think that can make a difference? Also, can you shed light on the TransCoder structure? It seems like TransCoder does not have a typical sequence-to-sequence architecture.
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (2 by maintainers)
Top GitHub Comments
I agree that adding a language embedding for every token can provide a stronger signal, that’s the reason why we chose to do it this way. I would have thought that adding a language token at the beginning of the sentence would be enough to generate a function in the target sentence since the model can learn to pay attention to this token. It should at least be able to learn to generate a “def” token instead of “public static” after a [python] token. I think the main difference could be that you are training with the denoising objective until convergence at first and then with the BT objective while we trained both at the same time. Then you can end up in a bad local minima where the model just copies the input sentence and ignore the language token. If you train with the BT objective at the same time, your model will learn early that the [python] token is followed by something like “def” (or import if you train on whole files) and it should work better. Actually we trained a baseline with denoising for a revision of DOBF and we only reloaded the encoder to make the unsupervised translation work. Otherwise we also get stuck in a state where the model copies the input sentence.
About the part of our code you linked to, we use the same TransformerModel class for our encoders and decoders and just add a class attribute
is_decoder
to know what kind of transformer it is and whether we should do cross attention.Thanks a lot for your comments. Perhaps, simultaneous training via DAE and BT is the key factor.