Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Negative CTC loss while training TFWav2Vec2ForCTC model

See original GitHub issue

System Info

transformers version: 4.21.0.dev0

Platform: Linux-5.13.0-48-generic-x86_64-with-glibc2.31
Python version: 3.7.13
PyTorch version (GPU?): 1.11.0 (False)
Tensorflow version (GPU?): 2.9.1 (True)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: <fill in>
Using distributed or parallel set-up in script?: <fill in>

Who can help?

@Rocketknight1 @gante

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, …)
My own task or dataset (give details below)

Reproduction

Colab link to reproduce: https://colab.research.google.com/drive/1HXOdDhaIWcLF_4xF-zKZ_gYRf-sMfHkL?usp=sharing

Epoch 1/5
 28/3859 [..............................] - ETA: 47:03 - loss: -0.5141

Expected behavior

The model should train with positive CTC loss. I have been able to figure out the source of the error which is that the target sequence never reaches the model at the forward pass and CTC loss is calculated over empty targets (None).

I have also figured out the solution which is:

add @unpack_inputs here: https://github.com/huggingface/transformers/blob/main/src/transformers/models/wav2vec2/modeling_tf_wav2vec2.py#L1583

However, with this, the CTC loss now gets the targets and calculates the loss but it raises another error:

InvalidArgumentError                      Traceback (most recent call last)
/tmp/ipykernel_33658/3396866883.py in <module>
      3 tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)
      4 
----> 5 model.fit(train, validation_data = validation, epochs=5)

~/anaconda3/envs/gsoc-2/lib/python3.7/site-packages/keras/utils/traceback_utils.py in error_handler(*args, **kwargs)
     65     except Exception as e:  # pylint: disable=broad-except
     66       filtered_tb = _process_traceback_frames(e.__traceback__)
---> 67       raise e.with_traceback(filtered_tb) from None
     68     finally:
     69       del filtered_tb

~/anaconda3/envs/gsoc-2/lib/python3.7/site-packages/transformers/modeling_tf_utils.py in train_step(self, data)
   1024 
   1025             if self._using_dummy_loss:
-> 1026                 loss = self.compiled_loss(y_pred.loss, y_pred.loss, sample_weight, regularization_losses=self.losses)
   1027             else:
   1028                 loss = None

InvalidArgumentError: slice index 0 of dimension 0 out of bounds. [Op:StridedSlice] name: strided_slice/

To solve this I added loss = tf.reshape(loss, (1,)) after CTC loss calculation here: https://github.com/huggingface/transformers/blob/main/src/transformers/models/wav2vec2/modeling_tf_wav2vec2.py#L1707

These solve the error and I can train my model now. I am hoping the changes get pushed to the main branch.

The issue was previously mentioned here: https://github.com/huggingface/transformers/issues/15114 But since @Rocketknight1 mentioned that he is working with loss calculation across HF TF models, I thought I would open a new issue.

Issue Analytics

State:
Created a year ago
Comments:6 (6 by maintainers)

Top GitHub Comments

1reaction

Sreyan88commented, Jul 12, 2022

Hi @Rocketknight1 ,

Yes, I am, trying to figure out #18096 though it’s a bit difficult for me as I am a bit new to Keras/Tensorflow. @gante 's suggestion did not work so I am still investigating!

Thank You for the reply!

0reactions

Rocketknight1commented, Jul 12, 2022

Hi @Sreyan88 - I can’t figure out where that error is coming from. In your example scripts above, you’re running everything eagerly, which means that AutoGraph should not be doing anything. I think this is probably related to the issues in #18096, but let me know if you resolve those and this issue is still occurring!