question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Wav2Vec2] Community: How to run Wav2Vec2 for inference

See original GitHub issue

The following gave me good results on the easy “clean” part of Librispeech. Might be helpful for others:-)

  1. Download a fine-tuned wav2vec AM:

$ wget https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_small_960h.pt

  1. Download dictionary:

$ wget https://dl.fbaipublicfiles.com/fairseq/wav2vec/dict.ltr.txt

  1. Load a sample of the Librispeech clean dataset for inference. Librispeech will soon be added to https://huggingface.co/datasets. In the meantime, I added a dummy dataset that takes a tiny portion of the clean-dev dataset for some quick experiments

$ pip install datasets $ pip install soundfile

from datasets import load_dataset
libri_dummy = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")

# check out the dataset
print(libri_dummy)
  1. Run a forward pass
import torch
import fairseq

input_sample = torch.tensor(libri_dummy[0]["speech"])[None, :]

model, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task(['/path/to/wav2vec_small_960h.pt'], arg_overrides={"data": "/path/to/folder/of/dict"})
model = model[0]
model.eval()

logits = model(source=input_sample, padding_mask=None)["encoder_out"]
  1. Decode the prediction

The output is a tensor of shape [seq_len, 1, vocab_size]. We are interested in the most likely token for each time step. So we can take the argmax:

  predicted_ids = torch.argmax(logits[:, 0], axis=-1)
  1. Now we’ll create our own decoder based on the dict we downloaded previously to decode the result
  json_dict = {"<s>": 0, "<pad>": 1, "</s>": 2, "<unk>": 3, "|": 4, "E": 5, "T": 6, "A": 7, "O": 8, "N": 9, "I": 10, "H": 11, "S": 12, "R": 13, "D": 14, "L": 15, "U": 16, "M": 17, "W": 18, "C": 19, "F": 20, "G": 21, "Y": 22, "P": 23, "B": 24, "V": 25, "K": 26, "'": 27, "X": 28, "J": 29, "Q": 30, "Z": 31}

and create a decoder

  import numpy as np
  from itertools import groupby

  class Decoder:
      def __init__(self, json_dict):
          self.dict = json_dict
          self.look_up = np.asarray(list(self.dict.keys()))
  
      def decode(self, ids):
          converted_tokens = self.look_up[ids]
          fused_tokens = [tok[0] for tok in groupby(converted_tokens)]
          output = ' '.join(''.join(''.join(fused_tokens).split("<s>")).split("|"))
          return output

Now we can decode the output and compare it to the correct output:

decoder = Decoder(json_dict=json_dict)
print("Prediction: ", decoder.decode(predicted_ids))

This should give *'A MAN SAID TO THE UNIVERSE SIR I EXIST ’

  • which fits with the correct output when compared to:

print(libri_dummy[0]["text"])

Wav2Vec2 will soon be available in 🤗 Transformers 😃

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:11
  • Comments:7

github_iconTop GitHub Comments

2reactions
raja1196commented, Jan 26, 2021

Thank you for focusing your time on this. I have tested a workaround wav2vec-docker, but this will help a lot more people.

0reactions
stale[bot]commented, May 1, 2022

Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to run Wav2Vec2 for inference · Issue #3155 - GitHub
Load a sample of the Librispeech clean dataset for inference. Librispeech will soon be added to https://huggingface.co/datasets . In the ...
Read more >
Wav2Vec2 - Hugging Face
A list of official Hugging Face and community (indicated by ) resources to help you get started with Wav2Vec2. If you're interested in...
Read more >
Speech Recognition with Wav2Vec2 - PyTorch
This tutorial shows how to perform speech recognition using using pre-trained models from wav2vec 2.0 [paper]. Overview. The process of speech recognition looks ......
Read more >
Fine-tuning Wav2Vec2 with an LM head | TensorFlow Hub
The underlying task is to build a model for Automatic Speech Recognition i.e. given some speech, the model should be able to transcribe...
Read more >
Fine-tune and deploy a Wav2Vec2 model for speech ...
Then we use SageMaker Script Mode for training and inference steps, which allows you to define and use custom training and inference scripts, ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found