Does it support masking?
See original GitHub issueHello CyberZHG
I have a sequence of inputs and sequence of outputs where each input has an associated output(Label). lets say (part of speech tagging (POS tagging))
Seq_in[0][0:3] array([[15],[28], [23]])
Seq_out[0][0:3] array([[0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.]], dtype=float32)
I am using the following code for training:
X_train, X_val, Y_train, Y_val = train_test_split(Seq_in,Seq_out, test_size=0.20)
model = Sequential() model.add(Masking(mask_value=5, input_shape= (Seq_in.shape[1],1))) # time steps is 500 model.add(Bidirectional(LSTM(256, return_sequences=True))) model.add(Dropout(0.2)) model.add(Bidirectional(LSTM(256, return_sequences=True))) model.add(Dropout(0.2)) model.add(seq_self_attention.SeqSelfAttention()) model.add(Dense(15, activation=‘softmax’))
sgd = optimizers.SGD(lr=.1,momentum=0.9,decay=1e-3,nesterov=True) model.compile(loss=‘categorical_crossentropy’, optimizer=sgd, metrics=[‘accuracy’])
model.fit(X_train,Y_train,epochs=2, validation_data=(X_val, Y_val),verbose=2)
I have a couple of concerns: it seems that the implementation supports masking, but what I am doing in the code is a correct way to use masking or there is another way?
why do we need the variable units in the constructor? does not the code figuer it out itself?
following the equations posted in the readme file, the process is to sum each neighbor states ht` with the state of the current time step ht, then taking the tanh of each unit in each state, which produce the same shape. first equation.
second, each states ht` is squashed to one value (scalar) using sigmoid function. Second equation.
Third, we find the softmax between the current state of the current time step with the other states ht`.
Finally, we multiply the softmax probability (attention weight) with each unit and then taking the weighted sum.
is my understanding correct? if so, why do we need the unit in the constructor?
Also, we have to methods multiplicative and additive, where can I see the difference in regard to the equations
Sorry, too many questions, I would appreciate your answers… Thank you
Issue Analytics
- State:
- Created 5 years ago
- Comments:6 (4 by maintainers)
Top GitHub Comments
See UOI-1806.01264 which is also a tagging task. The attention weights would be approximately equal when initialized, however, they do distribute after several epochs.
The default option is additive. The equation for multiplicative is in this section.