question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Inference of finetuned wav2vec2.0 fails in forward method

See original GitHub issue

Before asking:

  1. search the issues.
  2. search the docs.

What is your question?

Couldn’t produce a valid inference of wav2vec2.0 finetuned models from examples page. Forward method for some audio works fine with loaded pretrained models, but fails with loaded finetuned models.

Code

Loading of wav2vec2 inspired by #3025. Dict used: https://dl.fbaipublicfiles.com/fairseq/wav2vec/dict.ltr.txt.

# wav2vec_path = 'wav2vec_small.pt' # pretrained
wav2vec_path = 'wav2vec_small_100h.pt' # finetuned

model, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task(
  [wav2vec_path],
  arg_overrides={"data": "/path/to/ltr/dict/"}
)
model = model[0]
model.eval()

# inference
example_wav = torch.randn([1, 10000])
with torch.no_grad():
  model(example_wav.to(device), mask=False) # fails
  # model(example_wav.to(device), mask=False, features_only=True) # fails too

with error:

TypeError                                 Traceback (most recent call last)
<ipython-input-141-d5af471a3719> in <module>()
      1 example_wav = torch.randn([1, 10000])
      2 with torch.no_grad():
----> 3   model(example_wav.to(device), mask=False) # fails
      4   # model(example_wav.to(device), mask=False, features_only=True) # fails too
      5 

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    725             result = self._slow_forward(*input, **kwargs)
    726         else:
--> 727             result = self.forward(*input, **kwargs)
    728         for hook in itertools.chain(
    729                 _global_forward_hooks.values(),

TypeError: forward() takes 1 positional argument but 2 were given

What have you tried?

I guess, problem is in the head of constructed model, where due to ctc task there is None as final_proj. Model has following print output (also in #2922 was the same scheme in the head, error was different):

Wav2VecCtc(
  (w2v_encoder): Wav2VecEncoder(
    (w2v_model): Wav2Vec2Model(
      (feature_extractor): ConvFeatureExtractionModel(
        (conv_layers): ModuleList(
          (0): Sequential(
            (0): Conv1d(1, 512, kernel_size=(10,), stride=(5,), bias=False)
            (1): Dropout(p=0.0, inplace=False)
            (2): Fp32GroupNorm(512, 512, eps=1e-05, affine=True)
            (3): GELU()
          )
          (1): Sequential(
            (0): Conv1d(512, 512, kernel_size=(3,), stride=(2,), bias=False)
            (1): Dropout(p=0.0, inplace=False)
            (2): GELU()
          )
          (2): Sequential(
            (0): Conv1d(512, 512, kernel_size=(3,), stride=(2,), bias=False)
            (1): Dropout(p=0.0, inplace=False)
            (2): GELU()
          )
          (3): Sequential(
            (0): Conv1d(512, 512, kernel_size=(3,), stride=(2,), bias=False)
            (1): Dropout(p=0.0, inplace=False)
            (2): GELU()
          )
          (4): Sequential(
            (0): Conv1d(512, 512, kernel_size=(3,), stride=(2,), bias=False)
            (1): Dropout(p=0.0, inplace=False)
            (2): GELU()
          )
          (5): Sequential(
            (0): Conv1d(512, 512, kernel_size=(2,), stride=(2,), bias=False)
            (1): Dropout(p=0.0, inplace=False)
            (2): GELU()
          )
          (6): Sequential(
            (0): Conv1d(512, 512, kernel_size=(2,), stride=(2,), bias=False)
            (1): Dropout(p=0.0, inplace=False)
            (2): GELU()
          )
        )
      )
      (post_extract_proj): Linear(in_features=512, out_features=768, bias=True)
      (dropout_input): Dropout(p=0.1, inplace=False)
      (dropout_features): Dropout(p=0.1, inplace=False)
      (quantizer): None
      (project_q): None
      (encoder): TransformerEncoder(
        (pos_conv): Sequential(
          (0): Conv1d(768, 768, kernel_size=(128,), stride=(1,), padding=(64,), groups=16)
          (1): SamePad()
          (2): GELU()
        )
        (layers): ModuleList(
          (0): TransformerSentenceEncoderLayer(
            (self_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=768, out_features=768, bias=True)
              (v_proj): Linear(in_features=768, out_features=768, bias=True)
              (q_proj): Linear(in_features=768, out_features=768, bias=True)
              (out_proj): Linear(in_features=768, out_features=768, bias=True)
            )
            (dropout1): Dropout(p=0.1, inplace=False)
            (dropout2): Dropout(p=0.0, inplace=False)
            (dropout3): Dropout(p=0.1, inplace=False)
            (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          )
          (1): TransformerSentenceEncoderLayer(
            (self_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=768, out_features=768, bias=True)
              (v_proj): Linear(in_features=768, out_features=768, bias=True)
              (q_proj): Linear(in_features=768, out_features=768, bias=True)
              (out_proj): Linear(in_features=768, out_features=768, bias=True)
            )
            (dropout1): Dropout(p=0.1, inplace=False)
            (dropout2): Dropout(p=0.0, inplace=False)
            (dropout3): Dropout(p=0.1, inplace=False)
            (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          )
          (2): TransformerSentenceEncoderLayer(
            (self_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=768, out_features=768, bias=True)
              (v_proj): Linear(in_features=768, out_features=768, bias=True)
              (q_proj): Linear(in_features=768, out_features=768, bias=True)
              (out_proj): Linear(in_features=768, out_features=768, bias=True)
            )
            (dropout1): Dropout(p=0.1, inplace=False)
            (dropout2): Dropout(p=0.0, inplace=False)
            (dropout3): Dropout(p=0.1, inplace=False)
            (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          )
          (3): TransformerSentenceEncoderLayer(
            (self_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=768, out_features=768, bias=True)
              (v_proj): Linear(in_features=768, out_features=768, bias=True)
              (q_proj): Linear(in_features=768, out_features=768, bias=True)
              (out_proj): Linear(in_features=768, out_features=768, bias=True)
            )
            (dropout1): Dropout(p=0.1, inplace=False)
            (dropout2): Dropout(p=0.0, inplace=False)
            (dropout3): Dropout(p=0.1, inplace=False)
            (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          )
          (4): TransformerSentenceEncoderLayer(
            (self_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=768, out_features=768, bias=True)
              (v_proj): Linear(in_features=768, out_features=768, bias=True)
              (q_proj): Linear(in_features=768, out_features=768, bias=True)
              (out_proj): Linear(in_features=768, out_features=768, bias=True)
            )
            (dropout1): Dropout(p=0.1, inplace=False)
            (dropout2): Dropout(p=0.0, inplace=False)
            (dropout3): Dropout(p=0.1, inplace=False)
            (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          )
          (5): TransformerSentenceEncoderLayer(
            (self_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=768, out_features=768, bias=True)
              (v_proj): Linear(in_features=768, out_features=768, bias=True)
              (q_proj): Linear(in_features=768, out_features=768, bias=True)
              (out_proj): Linear(in_features=768, out_features=768, bias=True)
            )
            (dropout1): Dropout(p=0.1, inplace=False)
            (dropout2): Dropout(p=0.0, inplace=False)
            (dropout3): Dropout(p=0.1, inplace=False)
            (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          )
          (6): TransformerSentenceEncoderLayer(
            (self_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=768, out_features=768, bias=True)
              (v_proj): Linear(in_features=768, out_features=768, bias=True)
              (q_proj): Linear(in_features=768, out_features=768, bias=True)
              (out_proj): Linear(in_features=768, out_features=768, bias=True)
            )
            (dropout1): Dropout(p=0.1, inplace=False)
            (dropout2): Dropout(p=0.0, inplace=False)
            (dropout3): Dropout(p=0.1, inplace=False)
            (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          )
          (7): TransformerSentenceEncoderLayer(
            (self_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=768, out_features=768, bias=True)
              (v_proj): Linear(in_features=768, out_features=768, bias=True)
              (q_proj): Linear(in_features=768, out_features=768, bias=True)
              (out_proj): Linear(in_features=768, out_features=768, bias=True)
            )
            (dropout1): Dropout(p=0.1, inplace=False)
            (dropout2): Dropout(p=0.0, inplace=False)
            (dropout3): Dropout(p=0.1, inplace=False)
            (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          )
          (8): TransformerSentenceEncoderLayer(
            (self_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=768, out_features=768, bias=True)
              (v_proj): Linear(in_features=768, out_features=768, bias=True)
              (q_proj): Linear(in_features=768, out_features=768, bias=True)
              (out_proj): Linear(in_features=768, out_features=768, bias=True)
            )
            (dropout1): Dropout(p=0.1, inplace=False)
            (dropout2): Dropout(p=0.0, inplace=False)
            (dropout3): Dropout(p=0.1, inplace=False)
            (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          )
          (9): TransformerSentenceEncoderLayer(
            (self_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=768, out_features=768, bias=True)
              (v_proj): Linear(in_features=768, out_features=768, bias=True)
              (q_proj): Linear(in_features=768, out_features=768, bias=True)
              (out_proj): Linear(in_features=768, out_features=768, bias=True)
            )
            (dropout1): Dropout(p=0.1, inplace=False)
            (dropout2): Dropout(p=0.0, inplace=False)
            (dropout3): Dropout(p=0.1, inplace=False)
            (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          )
          (10): TransformerSentenceEncoderLayer(
            (self_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=768, out_features=768, bias=True)
              (v_proj): Linear(in_features=768, out_features=768, bias=True)
              (q_proj): Linear(in_features=768, out_features=768, bias=True)
              (out_proj): Linear(in_features=768, out_features=768, bias=True)
            )
            (dropout1): Dropout(p=0.1, inplace=False)
            (dropout2): Dropout(p=0.0, inplace=False)
            (dropout3): Dropout(p=0.1, inplace=False)
            (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          )
          (11): TransformerSentenceEncoderLayer(
            (self_attn): MultiheadAttention(
              (dropout_module): FairseqDropout()
              (k_proj): Linear(in_features=768, out_features=768, bias=True)
              (v_proj): Linear(in_features=768, out_features=768, bias=True)
              (q_proj): Linear(in_features=768, out_features=768, bias=True)
              (out_proj): Linear(in_features=768, out_features=768, bias=True)
            )
            (dropout1): Dropout(p=0.1, inplace=False)
            (dropout2): Dropout(p=0.0, inplace=False)
            (dropout3): Dropout(p=0.1, inplace=False)
            (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          )
        )
        (layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      )
      (layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
      (final_proj): None
    )
    (final_dropout): Dropout(p=0.0, inplace=False)
    (proj): Linear(in_features=768, out_features=32, bias=True)
  )
)

What’s your environment?

  • fairseq master branch (commit 4a6f89d373dafc50b416092b41c070c304b31698)
  • PyTorch Version: 1.7.0+cu101
  • OS (e.g., Linux): Linux (google colab environment)
  • How you installed fairseq (pip, source): source
  • Build command you used (if compiling from source):
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./
  • Python version: 3.6.9
  • Any other relevant information: -

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:2
  • Comments:5

github_iconTop GitHub Comments

1reaction
debadityamandalcommented, Mar 6, 2021

I am also getting same error. Please let me know if you have any solution.

0reactions
stale[bot]commented, May 2, 2022

Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Inference of finetuned wav2vec2.0 fails in forward method #3138
Couldn't produce a valid inference of wav2vec2.0 finetuned models from examples page. Forward method for some audio works fine with loaded ...
Read more >
Wav2Vec2 - Hugging Face
A blog post on how to finetune Wav2Vec2 for English ASR with Transformers. ... When used in normal mode, this method forwards all...
Read more >
Fine-tune and deploy a Wav2Vec2 model for speech ...
This post shows how to use SageMaker to easily fine-tune the latest Wav2Vec2 model from Hugging Face, and then deploy the model with...
Read more >
Wav2Vec2.0 on the Edge: Performance Evaluation - arXiv
The pretrained model is then fine tuned on small amounts of labeled data to use it for speech-to-text and machine translation tasks. Wav2Vec...
Read more >
Shrinking Bigfoot: Reducing wav2vec 2.0 footprint - arXiv Vanity
Hence, the inference latency of wav2vec 2.0 will be a bottleneck in production, ... of masked pre-training and transformers is similar to wav2vec2.0....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found