question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Generating own speaker embedding with ECAPA and OOM when running speaker_verification_cosine.py

See original GitHub issue

Hi, I’ve trained my own dataset with 2 gpus on the same devices following here. But I am confused about how to generate speaker embedding if I want to use my own model and checkpoint to input a custom audio file like this way:

import torchaudio
from speechbrain.pretrained import EncoderClassifier
classifier = EncoderClassifier.from_hparams(source="speechbrain/spkrec-ecapa-voxceleb")
signal, fs =torchaudio.load('samples/audio_samples/example1.wav')
embeddings = classifier.encode_batch(signal)

Or is there any suitable way to preprocess my input file if I write something like this (I copy the function compute_embedding from speaker_verification_cosine.py):

import os
import torch
import torchaudio
import speechbrain as sb
from hyperpyyaml import load_hyperpyyaml
from speechbrain.utils.distributed import run_on_main

def compute_embedding(wavs, wav_lens):
    """Compute speaker embeddings.

    Arguments
    ---------
    wavs : Torch.Tensor
        Tensor containing the speech waveform (batch, time).
        Make sure the sample rate is fs=16000 Hz.
    wav_lens: Torch.Tensor
        Tensor containing the relative length for each sentence
        in the length (e.g., [0.8 0.6 1.0])
    """
    with torch.no_grad():
        feats = params["compute_features"](wavs)
        feats = params["mean_var_norm"](feats, wav_lens)
        embeddings = params["embedding_model"](feats, wav_lens)
        embeddings = params["mean_var_norm_emb"](
            embeddings, torch.ones(embeddings.shape[0]).to(embeddings.device)
        )
    return embeddings.squeeze(1)

arg = ['hparams/verification_ecapa.yaml', '--data_folder=/disk/data/lrs3/lrs3_wav/']
params_file, run_opts, overrides = sb.core.parse_arguments(arg[:])
with open(params_file) as fin:
        params = load_hyperpyyaml(fin, overrides)

run_on_main(params["pretrainer"].collect_files)
params["pretrainer"].load_collected(params["device"])
params["embedding_model"].eval()
params["embedding_model"].to(params["device"])

wavs, fs =torchaudio.load('test.wav')
lens = torch.tensor([1.0]).to(params["device"])
emb = compute_embedding(wavs, lens).unsqueeze(1)

I am not sure about what format should wavs and lens should be. Or is there any simple way to do this?

Also, I cannot test my EER and minDCF when running speaker_verification_cosine.py. Even if I set the command line like this way:

CUDA_VISIBLE_DEVICES=0,1 python3 -m torch.distributed.launch --nproc_per_node=2 speaker_verification_cosine.py hparams/verification_ecapa.yaml --data_folder=/disk/data/wav/ --distributed_launch --distributed_backend='nccl' --data_parallel_backend

Gpu0 is always out of memory, and gpu1 seems to do nothing. Does anyone has any idea?

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:7 (1 by maintainers)

github_iconTop GitHub Comments

2reactions
underdogliucommented, May 18, 2022

Hi @anitaweng here is the function I used to extract Kaldi-based embeddings:

# Compute embeddings from the waveforms
def compute_embeddings_single(wavs, wav_lens, params):
    """Compute speaker embeddings.

    Arguments
    ---------
    wavs : Torch.Tensor
        Tensor containing the speech waveform (batch, time).
        Make sure the sample rate is fs=16000 Hz.
    wav_lens: Torch.Tensor
        Tensor containing the relative length for each sentence
        in the length (e.g., [0.8 0.6 1.0])
    """
    wavs = wavs.to(params["device"])
    wav_lens = wav_lens.to(params["device"])
    with torch.no_grad():
        feats = params["compute_features"](wavs)
        feats = params["mean_var_norm"](feats, wav_lens).to(params["device"])
        embeddings = params["embedding_model"](feats)
        embeddings = params["mean_var_norm_emb"](
            embeddings, torch.ones(embeddings.shape[0]).to(embeddings.device)
        )
    return embeddings.squeeze(1)


def compute_embeddings(params, wav_scp, outdir, ark):
    with torch.no_grad():
        with open(wav_scp, "r") as wavscp:
            for line in wavscp:
                utt, wav_path = line.split()
                out_file = "{}/npys/{}.npy".format(outdir, utt)
                wav, _ = torchaudio.load(wav_path)
                data = wav.transpose(0, 1).squeeze(1).unsqueeze(0)
                embedding = compute_embeddings_single(data, torch.Tensor([data.shape[0]]), params).squeeze()

                out_embedding = embedding.detach().cpu().numpy()
                np.save(out_file, out_embedding)
                write_vecs_to_kaldi(out_embedding, utt, ark)
                del out_embedding, wav, data

Please ignore related dependencies, but as you can see from the caller function, these lines in particular:

data = wav.transpose(0, 1).squeeze(1).unsqueeze(0)
embedding = compute_embeddings_single(data, torch.Tensor([data.shape[0]]), params).squeeze()

I set the length equal to the input wav length. This means we do not perform any kind of padding/chunking if the length of input wav suffices. It works for me. Maybe you wanna have a try.

2reactions
anautschcommented, May 18, 2022

Hi @anitaweng just to quickly respond to your question on wavs and lens. Please take a look how it’s handled in the interface for the pretrained speaker recognition model. SpeechBrain provides a load_audio(path_x) function. Lengths are explained here - if both files are of equal duration, they get both a ‘1.0’, otherwise the long one is at 1.0 and the other is a relative 0.xyz… factor.

Do you have the same issue with single GPU training? (to sort out potential error sources)

Read more comments on GitHub >

github_iconTop Results From Across the Web

speechbrain/README.md at develop - VoxCeleb - GitHub
Speaker verification using ECAPA-TDNN embeddings. Run the following command to train speaker embeddings using ECAPA-TDNN: python train_speaker_embeddings.py ...
Read more >
speechbrain/spkrec-ecapa-voxceleb - Hugging Face
Speaker Verification with ECAPA-TDNN embeddings on Voxceleb. This repository provides all the necessary tools to perform speaker verification with a pretrained ...
Read more >
ECAPA-TDNN Embeddings for Speaker Diarization - arXiv
Learning robust speaker embeddings is a crucial step in speaker diarization. Deep neural networks can accurately capture speaker discriminative ...
Read more >
Residual Information in Deep Speaker Embedding Architectures
The best performing systems included ResNet and ECAPA-TDNN architectures augmented with self-supervised learned (SSL) representations of the audio signal [28,29 ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found