Inference of finetuned wav2vec2.0 fails in forward method
See original GitHub issueBefore asking:
- search the issues.
- search the docs.
What is your question?
Couldn’t produce a valid inference of wav2vec2.0 finetuned models from examples page. Forward method for some audio works fine with loaded pretrained models, but fails with loaded finetuned models.
Code
Loading of wav2vec2 inspired by #3025. Dict used: https://dl.fbaipublicfiles.com/fairseq/wav2vec/dict.ltr.txt.
# wav2vec_path = 'wav2vec_small.pt' # pretrained
wav2vec_path = 'wav2vec_small_100h.pt' # finetuned
model, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task(
[wav2vec_path],
arg_overrides={"data": "/path/to/ltr/dict/"}
)
model = model[0]
model.eval()
# inference
example_wav = torch.randn([1, 10000])
with torch.no_grad():
model(example_wav.to(device), mask=False) # fails
# model(example_wav.to(device), mask=False, features_only=True) # fails too
with error:
TypeError Traceback (most recent call last)
<ipython-input-141-d5af471a3719> in <module>()
1 example_wav = torch.randn([1, 10000])
2 with torch.no_grad():
----> 3 model(example_wav.to(device), mask=False) # fails
4 # model(example_wav.to(device), mask=False, features_only=True) # fails too
5
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
725 result = self._slow_forward(*input, **kwargs)
726 else:
--> 727 result = self.forward(*input, **kwargs)
728 for hook in itertools.chain(
729 _global_forward_hooks.values(),
TypeError: forward() takes 1 positional argument but 2 were given
What have you tried?
I guess, problem is in the head of constructed model, where due to ctc task there is None
as final_proj
. Model has following print output (also in #2922 was the same scheme in the head, error was different):
Wav2VecCtc(
(w2v_encoder): Wav2VecEncoder(
(w2v_model): Wav2Vec2Model(
(feature_extractor): ConvFeatureExtractionModel(
(conv_layers): ModuleList(
(0): Sequential(
(0): Conv1d(1, 512, kernel_size=(10,), stride=(5,), bias=False)
(1): Dropout(p=0.0, inplace=False)
(2): Fp32GroupNorm(512, 512, eps=1e-05, affine=True)
(3): GELU()
)
(1): Sequential(
(0): Conv1d(512, 512, kernel_size=(3,), stride=(2,), bias=False)
(1): Dropout(p=0.0, inplace=False)
(2): GELU()
)
(2): Sequential(
(0): Conv1d(512, 512, kernel_size=(3,), stride=(2,), bias=False)
(1): Dropout(p=0.0, inplace=False)
(2): GELU()
)
(3): Sequential(
(0): Conv1d(512, 512, kernel_size=(3,), stride=(2,), bias=False)
(1): Dropout(p=0.0, inplace=False)
(2): GELU()
)
(4): Sequential(
(0): Conv1d(512, 512, kernel_size=(3,), stride=(2,), bias=False)
(1): Dropout(p=0.0, inplace=False)
(2): GELU()
)
(5): Sequential(
(0): Conv1d(512, 512, kernel_size=(2,), stride=(2,), bias=False)
(1): Dropout(p=0.0, inplace=False)
(2): GELU()
)
(6): Sequential(
(0): Conv1d(512, 512, kernel_size=(2,), stride=(2,), bias=False)
(1): Dropout(p=0.0, inplace=False)
(2): GELU()
)
)
)
(post_extract_proj): Linear(in_features=512, out_features=768, bias=True)
(dropout_input): Dropout(p=0.1, inplace=False)
(dropout_features): Dropout(p=0.1, inplace=False)
(quantizer): None
(project_q): None
(encoder): TransformerEncoder(
(pos_conv): Sequential(
(0): Conv1d(768, 768, kernel_size=(128,), stride=(1,), padding=(64,), groups=16)
(1): SamePad()
(2): GELU()
)
(layers): ModuleList(
(0): TransformerSentenceEncoderLayer(
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=768, out_features=768, bias=True)
(v_proj): Linear(in_features=768, out_features=768, bias=True)
(q_proj): Linear(in_features=768, out_features=768, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.0, inplace=False)
(dropout3): Dropout(p=0.1, inplace=False)
(self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
(1): TransformerSentenceEncoderLayer(
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=768, out_features=768, bias=True)
(v_proj): Linear(in_features=768, out_features=768, bias=True)
(q_proj): Linear(in_features=768, out_features=768, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.0, inplace=False)
(dropout3): Dropout(p=0.1, inplace=False)
(self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
(2): TransformerSentenceEncoderLayer(
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=768, out_features=768, bias=True)
(v_proj): Linear(in_features=768, out_features=768, bias=True)
(q_proj): Linear(in_features=768, out_features=768, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.0, inplace=False)
(dropout3): Dropout(p=0.1, inplace=False)
(self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
(3): TransformerSentenceEncoderLayer(
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=768, out_features=768, bias=True)
(v_proj): Linear(in_features=768, out_features=768, bias=True)
(q_proj): Linear(in_features=768, out_features=768, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.0, inplace=False)
(dropout3): Dropout(p=0.1, inplace=False)
(self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
(4): TransformerSentenceEncoderLayer(
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=768, out_features=768, bias=True)
(v_proj): Linear(in_features=768, out_features=768, bias=True)
(q_proj): Linear(in_features=768, out_features=768, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.0, inplace=False)
(dropout3): Dropout(p=0.1, inplace=False)
(self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
(5): TransformerSentenceEncoderLayer(
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=768, out_features=768, bias=True)
(v_proj): Linear(in_features=768, out_features=768, bias=True)
(q_proj): Linear(in_features=768, out_features=768, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.0, inplace=False)
(dropout3): Dropout(p=0.1, inplace=False)
(self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
(6): TransformerSentenceEncoderLayer(
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=768, out_features=768, bias=True)
(v_proj): Linear(in_features=768, out_features=768, bias=True)
(q_proj): Linear(in_features=768, out_features=768, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.0, inplace=False)
(dropout3): Dropout(p=0.1, inplace=False)
(self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
(7): TransformerSentenceEncoderLayer(
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=768, out_features=768, bias=True)
(v_proj): Linear(in_features=768, out_features=768, bias=True)
(q_proj): Linear(in_features=768, out_features=768, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.0, inplace=False)
(dropout3): Dropout(p=0.1, inplace=False)
(self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
(8): TransformerSentenceEncoderLayer(
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=768, out_features=768, bias=True)
(v_proj): Linear(in_features=768, out_features=768, bias=True)
(q_proj): Linear(in_features=768, out_features=768, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.0, inplace=False)
(dropout3): Dropout(p=0.1, inplace=False)
(self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
(9): TransformerSentenceEncoderLayer(
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=768, out_features=768, bias=True)
(v_proj): Linear(in_features=768, out_features=768, bias=True)
(q_proj): Linear(in_features=768, out_features=768, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.0, inplace=False)
(dropout3): Dropout(p=0.1, inplace=False)
(self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
(10): TransformerSentenceEncoderLayer(
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=768, out_features=768, bias=True)
(v_proj): Linear(in_features=768, out_features=768, bias=True)
(q_proj): Linear(in_features=768, out_features=768, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.0, inplace=False)
(dropout3): Dropout(p=0.1, inplace=False)
(self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
(11): TransformerSentenceEncoderLayer(
(self_attn): MultiheadAttention(
(dropout_module): FairseqDropout()
(k_proj): Linear(in_features=768, out_features=768, bias=True)
(v_proj): Linear(in_features=768, out_features=768, bias=True)
(q_proj): Linear(in_features=768, out_features=768, bias=True)
(out_proj): Linear(in_features=768, out_features=768, bias=True)
)
(dropout1): Dropout(p=0.1, inplace=False)
(dropout2): Dropout(p=0.0, inplace=False)
(dropout3): Dropout(p=0.1, inplace=False)
(self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(fc1): Linear(in_features=768, out_features=3072, bias=True)
(fc2): Linear(in_features=3072, out_features=768, bias=True)
(final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
)
(layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)
(layer_norm): LayerNorm((512,), eps=1e-05, elementwise_affine=True)
(final_proj): None
)
(final_dropout): Dropout(p=0.0, inplace=False)
(proj): Linear(in_features=768, out_features=32, bias=True)
)
)
What’s your environment?
- fairseq master branch (commit 4a6f89d373dafc50b416092b41c070c304b31698)
- PyTorch Version: 1.7.0+cu101
- OS (e.g., Linux): Linux (google colab environment)
- How you installed fairseq (
pip
, source): source - Build command you used (if compiling from source):
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./
- Python version: 3.6.9
- Any other relevant information: -
Issue Analytics
- State:
- Created 3 years ago
- Reactions:2
- Comments:5
Top Results From Across the Web
Inference of finetuned wav2vec2.0 fails in forward method #3138
Couldn't produce a valid inference of wav2vec2.0 finetuned models from examples page. Forward method for some audio works fine with loaded ...
Read more >Wav2Vec2 - Hugging Face
A blog post on how to finetune Wav2Vec2 for English ASR with Transformers. ... When used in normal mode, this method forwards all...
Read more >Fine-tune and deploy a Wav2Vec2 model for speech ...
This post shows how to use SageMaker to easily fine-tune the latest Wav2Vec2 model from Hugging Face, and then deploy the model with...
Read more >Wav2Vec2.0 on the Edge: Performance Evaluation - arXiv
The pretrained model is then fine tuned on small amounts of labeled data to use it for speech-to-text and machine translation tasks. Wav2Vec...
Read more >Shrinking Bigfoot: Reducing wav2vec 2.0 footprint - arXiv Vanity
Hence, the inference latency of wav2vec 2.0 will be a bottleneck in production, ... of masked pre-training and transformers is similar to wav2vec2.0....
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I am also getting same error. Please let me know if you have any solution.
Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!