Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Generating own speaker embedding with ECAPA and OOM when running speaker_verification_cosine.py

See original GitHub issue

Hi, I’ve trained my own dataset with 2 gpus on the same devices following here. But I am confused about how to generate speaker embedding if I want to use my own model and checkpoint to input a custom audio file like this way:

import torchaudio
from speechbrain.pretrained import EncoderClassifier
classifier = EncoderClassifier.from_hparams(source="speechbrain/spkrec-ecapa-voxceleb")
signal, fs =torchaudio.load('samples/audio_samples/example1.wav')
embeddings = classifier.encode_batch(signal)

Or is there any suitable way to preprocess my input file if I write something like this (I copy the function compute_embedding from speaker_verification_cosine.py):

import os
import torch
import torchaudio
import speechbrain as sb
from hyperpyyaml import load_hyperpyyaml
from speechbrain.utils.distributed import run_on_main

def compute_embedding(wavs, wav_lens):
    """Compute speaker embeddings.

    Arguments
    ---------
    wavs : Torch.Tensor
        Tensor containing the speech waveform (batch, time).
        Make sure the sample rate is fs=16000 Hz.
    wav_lens: Torch.Tensor
        Tensor containing the relative length for each sentence
        in the length (e.g., [0.8 0.6 1.0])
    """
    with torch.no_grad():
        feats = params["compute_features"](wavs)
        feats = params["mean_var_norm"](feats, wav_lens)
        embeddings = params["embedding_model"](feats, wav_lens)
        embeddings = params["mean_var_norm_emb"](
            embeddings, torch.ones(embeddings.shape[0]).to(embeddings.device)
        )
    return embeddings.squeeze(1)

arg = ['hparams/verification_ecapa.yaml', '--data_folder=/disk/data/lrs3/lrs3_wav/']
params_file, run_opts, overrides = sb.core.parse_arguments(arg[:])
with open(params_file) as fin:
        params = load_hyperpyyaml(fin, overrides)

run_on_main(params["pretrainer"].collect_files)
params["pretrainer"].load_collected(params["device"])
params["embedding_model"].eval()
params["embedding_model"].to(params["device"])

wavs, fs =torchaudio.load('test.wav')
lens = torch.tensor([1.0]).to(params["device"])
emb = compute_embedding(wavs, lens).unsqueeze(1)

I am not sure about what format should wavs and lens should be. Or is there any simple way to do this?

Also, I cannot test my EER and minDCF when running speaker_verification_cosine.py. Even if I set the command line like this way:

CUDA_VISIBLE_DEVICES=0,1 python3 -m torch.distributed.launch --nproc_per_node=2 speaker_verification_cosine.py hparams/verification_ecapa.yaml --data_folder=/disk/data/wav/ --distributed_launch --distributed_backend='nccl' --data_parallel_backend

Gpu0 is always out of memory, and gpu1 seems to do nothing. Does anyone has any idea?

Issue Analytics

State:
Created a year ago
Comments:7 (1 by maintainers)

Top GitHub Comments

2reactions

underdogliucommented, May 18, 2022

Hi @anitaweng here is the function I used to extract Kaldi-based embeddings:

# Compute embeddings from the waveforms
def compute_embeddings_single(wavs, wav_lens, params):
    """Compute speaker embeddings.

    Arguments
    ---------
    wavs : Torch.Tensor
        Tensor containing the speech waveform (batch, time).
        Make sure the sample rate is fs=16000 Hz.
    wav_lens: Torch.Tensor
        Tensor containing the relative length for each sentence
        in the length (e.g., [0.8 0.6 1.0])
    """
    wavs = wavs.to(params["device"])
    wav_lens = wav_lens.to(params["device"])
    with torch.no_grad():
        feats = params["compute_features"](wavs)
        feats = params["mean_var_norm"](feats, wav_lens).to(params["device"])
        embeddings = params["embedding_model"](feats)
        embeddings = params["mean_var_norm_emb"](
            embeddings, torch.ones(embeddings.shape[0]).to(embeddings.device)
        )
    return embeddings.squeeze(1)


def compute_embeddings(params, wav_scp, outdir, ark):
    with torch.no_grad():
        with open(wav_scp, "r") as wavscp:
            for line in wavscp:
                utt, wav_path = line.split()
                out_file = "{}/npys/{}.npy".format(outdir, utt)
                wav, _ = torchaudio.load(wav_path)
                data = wav.transpose(0, 1).squeeze(1).unsqueeze(0)
                embedding = compute_embeddings_single(data, torch.Tensor([data.shape[0]]), params).squeeze()

                out_embedding = embedding.detach().cpu().numpy()
                np.save(out_file, out_embedding)
                write_vecs_to_kaldi(out_embedding, utt, ark)
                del out_embedding, wav, data

Please ignore related dependencies, but as you can see from the caller function, these lines in particular:

data = wav.transpose(0, 1).squeeze(1).unsqueeze(0)
embedding = compute_embeddings_single(data, torch.Tensor([data.shape[0]]), params).squeeze()

I set the length equal to the input wav length. This means we do not perform any kind of padding/chunking if the length of input wav suffices. It works for me. Maybe you wanna have a try.

2reactions

anautschcommented, May 18, 2022

Hi @anitaweng just to quickly respond to your question on wavs and lens. Please take a look how it’s handled in the interface for the pretrained speaker recognition model. SpeechBrain provides a load_audio(path_x) function. Lengths are explained here - if both files are of equal duration, they get both a ‘1.0’, otherwise the long one is at 1.0 and the other is a relative 0.xyz… factor.

Do you have the same issue with single GPU training? (to sort out potential error sources)

Top Results From Across the Web

speechbrain/README.md at develop - VoxCeleb - GitHub

Speaker verification using ECAPA-TDNN embeddings. Run the following command to train speaker embeddings using ECAPA-TDNN: python train_speaker_embeddings.py ...

speechbrain/spkrec-ecapa-voxceleb - Hugging Face

Speaker Verification with ECAPA-TDNN embeddings on Voxceleb. This repository provides all the necessary tools to perform speaker verification with a pretrained ...

ECAPA-TDNN Embeddings for Speaker Diarization - arXiv

Learning robust speaker embeddings is a crucial step in speaker diarization. Deep neural networks can accurately capture speaker discriminative ...

Residual Information in Deep Speaker Embedding Architectures

The best performing systems included ResNet and ECAPA-TDNN architectures augmented with self-supervised learned (SSL) representations of the audio signal [28,29 ...