Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Score2Perf: A Question Regarding Mean-Aggregation and Decoder Inputs

See original GitHub issue

Hi all,

Thank you to the Magenta team for building this - it’s truly amazing to see what a community like this can accomplish. Makes 2020 that much better 👍 Full disclosure - I’m coming at this from PyTorch and have the majority of my experience working with Huggingface’s models. So forgive me if I mis-parse or don’t understand Tensorflow conventions yet. Go easy on me!

Context: In the accompanying paper to the score2perf model, Encoding Musical Style with Transformer Autoencoders, you discuss ‘temporal compression’ (page 2 diagram, below) - I am uncertain of how the compressed representation is attended to in the decoder: tfae

The point in the code where it looks like this is implemented is here (line 110ish, transformer_autoencoder.py):

  if not baseline:
    encoder_output = tf.math.reduce_mean(
        encoder_output, axis=1, keep_dims=True)
    encoder_decoder_attention_bias = tf.math.reduce_mean(
        encoder_decoder_attention_bias, axis=-1, keep_dims=True)

My Question(s) Firstly - doesn’t reduce_mean include in its calculation outputs corresponding to <pad> tokens? E.g. a sequence input to the model with seq_length 123 with a model block size/max input size of 512 would have 512-123 = 389 pad tokens at its end, which (I believe) means that the final 389 vectors in the output space (of shape [1, 512, d_model]) would be meaningless (since the attention mask at input would be zero for all those positions). Shouldn’t we aggregate only non-pad-related outputs?

Second - how is the resulting output vector attended to in the decode step? I’m used to encoder outputs being size [batch, block_size, d_model], not [batch, 1, d_model] (which I believe would be the case here). Are all the cross attentions attending to a single vector? Is that okay?

Third - what do the decoder teacher-forcing inputs look like at train time? Say we’re encoding input tokens [[4,2,7]]. Are our decoder inputs [[0, 4, 2]] (standard right_shifting per T5/etc causal language modeling), or are they something else? Not sure how the aggregated encoding changes things.

Any help would be greatly appreciated!

Issue Analytics

State:
Created 3 years ago
Comments:6

Top GitHub Comments

1reaction

apteryxlabscommented, Jan 25, 2021

@eeelnico You’ll want to read this.

0reactions

apteryxlabscommented, Jan 25, 2021

Thanks @kristychoi! We can pretty readily expand our dataset; I’ll try that plus the masking (if present). For the perturbations, masks and substitutions we’ll definitely try. If we try anything else that works, I’ll be sure to post it here.

Thank you for your help! It’s good to know we’re more or less on the right track.

Top Results From Across the Web

magenta/transformer_autoencoder.py at main - GitHub

The Transformer autoencoder consists of an encoder and a decoder. The models. currently support conditioning on both performance and melody -- some things....

Question answering - Hugging Face Course

This involves posing questions about a document and identifying the answers as spans of text in the document itself.

what is the first input to the decoder in a transformer model?

At each decoding time step, the decoder receives 2 inputs: the encoder output: this is computed once and is fed to all layers...

Transformer — PyTorch 1.13 documentation

The architecture is based on the paper “Attention Is All You Need”. ... (int) – the number of expected features in the encoder/decoder...