Score2Perf: A Question Regarding Mean-Aggregation and Decoder Inputs
See original GitHub issueHi all,
Thank you to the Magenta team for building this - itās truly amazing to see what a community like this can accomplish. Makes 2020 that much better š Full disclosure - Iām coming at this from PyTorch and have the majority of my experience working with Huggingfaceās models. So forgive me if I mis-parse or donāt understand Tensorflow conventions yet. Go easy on me!
Context: In the accompanying paper to the score2perf model, Encoding Musical Style with Transformer Autoencoders, you discuss ātemporal compressionā (page 2 diagram, below) - I am uncertain of how the compressed representation is attended to in the decoder:
The point in the code where it looks like this is implemented is here (line 110ish, transformer_autoencoder.py):
if not baseline:
encoder_output = tf.math.reduce_mean(
encoder_output, axis=1, keep_dims=True)
encoder_decoder_attention_bias = tf.math.reduce_mean(
encoder_decoder_attention_bias, axis=-1, keep_dims=True)
My Question(s) Firstly - doesnāt reduce_mean include in its calculation outputs corresponding to <pad> tokens? E.g. a sequence input to the model with seq_length 123 with a model block size/max input size of 512 would have 512-123 = 389 pad tokens at its end, which (I believe) means that the final 389 vectors in the output space (of shape [1, 512, d_model]) would be meaningless (since the attention mask at input would be zero for all those positions). Shouldnāt we aggregate only non-pad-related outputs?
Second - how is the resulting output vector attended to in the decode step? Iām used to encoder outputs being size [batch, block_size, d_model], not [batch, 1, d_model] (which I believe would be the case here). Are all the cross attentions attending to a single vector? Is that okay?
Third - what do the decoder teacher-forcing inputs look like at train time? Say weāre encoding input tokens [[4,2,7]]. Are our decoder inputs [[0, 4, 2]] (standard right_shifting per T5/etc causal language modeling), or are they something else? Not sure how the aggregated encoding changes things.
Any help would be greatly appreciated!
Issue Analytics
- State:
- Created 3 years ago
- Comments:6
Top GitHub Comments
@eeelnico Youāll want to read this.
Thanks @kristychoi! We can pretty readily expand our dataset; Iāll try that plus the masking (if present). For the perturbations, masks and substitutions weāll definitely try. If we try anything else that works, Iāll be sure to post it here.
Thank you for your help! Itās good to know weāre more or less on the right track.