question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Trivial/degenerate solution to Wav2vec 2

See original GitHub issue

❓ Questions and Help

Before asking:

  1. search the issues.
  2. search the docs.

What is your question?

I was trying to understand Wav2Vec 2.0, and it seems the implementation might lead to trivial solutions in some cases.

Specifically, if the model always assigns positive and negative samples with the same code, it might get very good InfoNCE estimation.

The reason I believe lies in the implementation of compute_pred() in https://github.com/pytorch/fairseq/blob/master/fairseq/models/wav2vec/wav2vec2.py#L478, where if neg_is_pos() evaluates True, the negative will be assigned with -inf value as for logits. In this case, the cross-entropy loss in wav2vec_criterion.py will be trivially minimized, despite that the learnt representation may not be necessarily meaningful.

I wonder is this something by design? Did you encounter this issue before and have any experience on how to avoid it?

Your help is very much appreciated.

Code

    def compute_preds(self, x, y, negatives):

        neg_is_pos = (y == negatives).all(-1)
        y = y.unsqueeze(0)
        targets = torch.cat([y, negatives], dim=0)

        logits = torch.cosine_similarity(x.float(), targets.float(), dim=-1).type_as(x)

        logits = logits / self.logit_temp

        if is_xla_tensor(logits) or neg_is_pos.any():
            fillval = -float(2 ** 30)
            if not hasattr(self, "_inftensor"):
                self._inftensor = (
                    torch.tensor(fillval).to(x.device)
                    if is_xla_tensor(logits)
                    else float("-inf")
                )
            logits[1:] = index_put(logits[1:], neg_is_pos, self._inftensor)

        return logits

What have you tried?

What’s your environment?

  • fairseq Version (master):
  • PyTorch Version (1.7)
  • OS (e.g., Linux): ubuntu 18
  • How you installed fairseq (pip, source): source
  • Build command you used (if compiling from source): pip install --editable ./
  • Python version: 3.7
  • CUDA/cuDNN version: 10.1
  • GPU models and configuration: 2080 Ti
  • Any other relevant information:

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:10
  • Comments:5

github_iconTop GitHub Comments

3reactions
KinWaiCheukcommented, Aug 4, 2021

I am facing the same problem. I am training wav2vec with my own dataset using the default wav2vec2_base.yaml configuration, but after some time, the training accuracy would drops to zero while the validation accuracy increase to 1. image

The code perplexity and loss_0 do not look right either image image

When I try to load the best checkpoint and see what feature wav2vec extracted, I see the output full of -inf

model(x)['x']
output:
tensor([[[9.9857, 9.9857, 9.9857,  ..., 9.9857, 9.9857, 9.9857],
          [9.9857, 9.9857, 9.9857,  ..., 9.9857, 9.9857, 9.9857],
          [9.9857, 9.9857, 9.9857,  ..., 9.9857, 9.9857, 9.9857],
          [9.9857, 9.9857, 9.9857,  ..., 9.9857, 9.9857, 9.9857]],
 
         [[  -inf,   -inf,   -inf,  ...,   -inf,   -inf,   -inf],
          [  -inf,   -inf,   -inf,  ...,   -inf,   -inf,   -inf],
          [  -inf,   -inf,   -inf,  ...,   -inf,   -inf,   -inf],
          [  -inf,   -inf,   -inf,  ...,   -inf,   -inf,   -inf]],
 
         [[  -inf,   -inf,   -inf,  ...,   -inf,   -inf,   -inf],
          [  -inf,   -inf,   -inf,  ...,   -inf,   -inf,   -inf],
          [  -inf,   -inf,   -inf,  ...,   -inf,   -inf,   -inf],
          [  -inf,   -inf,   -inf,  ...,   -inf,   -inf,   -inf]],
 
         ...,
 
         [[  -inf,   -inf,   -inf,  ...,   -inf,   -inf,   -inf],
          [  -inf,   -inf,   -inf,  ...,   -inf,   -inf,   -inf],
          [  -inf,   -inf,   -inf,  ...,   -inf,   -inf,   -inf],
          [  -inf,   -inf,   -inf,  ...,   -inf,   -inf,   -inf]],
 
         [[  -inf,   -inf,   -inf,  ...,   -inf,   -inf,   -inf],
          [  -inf,   -inf,   -inf,  ...,   -inf,   -inf,   -inf],
          [  -inf,   -inf,   -inf,  ...,   -inf,   -inf,   -inf],
          [  -inf,   -inf,   -inf,  ...,   -inf,   -inf,   -inf]],
 
         [[  -inf,   -inf,   -inf,  ...,   -inf,   -inf,   -inf],
          [  -inf,   -inf,   -inf,  ...,   -inf,   -inf,   -inf],
          [  -inf,   -inf,   -inf,  ...,   -inf,   -inf,   -inf],
          [  -inf,   -inf,   -inf,  ...,   -inf,   -inf,   -inf]]],
        grad_fn=<CopySlices>

When I try to get the validation feature, the features are all the same across all time steps

model(x, mask=False, features_only=True)['x']
output:
tensor([[[ 0.5195,  1.2744,  1.0260,  ..., -0.6923,  0.0550, -0.5728],
         [ 0.5195,  1.2744,  1.0260,  ..., -0.6923,  0.0550, -0.5728],
         [ 0.5195,  1.2744,  1.0260,  ..., -0.6923,  0.0550, -0.5728],
         ...,
         [ 0.5195,  1.2744,  1.0260,  ..., -0.6923,  0.0550, -0.5728],
         [ 0.5195,  1.2744,  1.0260,  ..., -0.6923,  0.0550, -0.5728],
         [ 0.5195,  1.2744,  1.0260,  ..., -0.6923,  0.0550, -0.5728]],

        [[ 0.5195,  1.2744,  1.0260,  ..., -0.6923,  0.0550, -0.5728],
         [ 0.5195,  1.2744,  1.0260,  ..., -0.6923,  0.0550, -0.5728],
         [ 0.5195,  1.2744,  1.0260,  ..., -0.6923,  0.0550, -0.5728],
         ...,
         [ 0.5195,  1.2744,  1.0260,  ..., -0.6923,  0.0550, -0.5728],
         [ 0.5195,  1.2744,  1.0260,  ..., -0.6923,  0.0550, -0.5728],
         [ 0.5195,  1.2744,  1.0260,  ..., -0.6923,  0.0550, -0.5728]],

        [[ 0.5195,  1.2744,  1.0260,  ..., -0.6923,  0.0550, -0.5728],
         [ 0.5195,  1.2744,  1.0260,  ..., -0.6923,  0.0550, -0.5728],
         [ 0.5195,  1.2744,  1.0260,  ..., -0.6923,  0.0550, -0.5728],
         ...,
         [ 0.5195,  1.2744,  1.0260,  ..., -0.6923,  0.0550, -0.5728],
         [ 0.5195,  1.2744,  1.0260,  ..., -0.6923,  0.0550, -0.5728],
         [ 0.5195,  1.2744,  1.0260,  ..., -0.6923,  0.0550, -0.5728]],

        [[ 0.5195,  1.2744,  1.0260,  ..., -0.6923,  0.0550, -0.5728],
         [ 0.5195,  1.2744,  1.0260,  ..., -0.6923,  0.0550, -0.5728],
         [ 0.5195,  1.2744,  1.0260,  ..., -0.6923,  0.0550, -0.5728],
         ...,
         [ 0.5195,  1.2744,  1.0260,  ..., -0.6923,  0.0550, -0.5728],
         [ 0.5195,  1.2744,  1.0260,  ..., -0.6923,  0.0550, -0.5728],
         [ 0.5195,  1.2744,  1.0260,  ..., -0.6923,  0.0550, -0.5728]]],
       grad_fn=<TransposeBackward0>)

My environment is?

fairseq Version (master): PyTorch Version (1.9) OS (e.g., Linux): ubuntu 18 How you installed fairseq (pip, source): source Build command you used (if compiling from source): pip install --editable ./ Python version: 3.8.10 CUDA/cuDNN version: 11.2 GPU models and configuration: Tesla V100 Any other relevant information: I am training wav2vec with my own dataset, where the average audio lengths are around 7 minutes.

0reactions
stale[bot]commented, Apr 17, 2022

Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Trivial/degenerate solution to Wav2vec 2 · Issue #3735 - GitHub
I was trying to understand Wav2Vec 2.0, and it seems the implementation might lead to trivial solutions in some cases. Specifically, if the ......
Read more >
Wav2Vec2 - Hugging Face
Wav2Vec2 is a speech model that accepts a float array corresponding to the raw waveform of the speech signal. Wav2Vec2 model was trained...
Read more >
wav2vec 2.0: A Framework for Self-Supervised Learning of ...
Abstract: We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed ...
Read more >
Wav2Vec 2.0: Self-Supervised Learning for ASR
Wav2Vec 2.0: state-of-the-art model for Automatic Speech Recognition. It takes advantage from a self-supervised training and contrastive learning.
Read more >
Wav2vec 2.0: Learning the structure of speech from raw audio
The model is trained to predict the correct speech unit for masked parts of the audio, while at the same time learning what...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found