Generating own speaker embedding with ECAPA and OOM when running speaker_verification_cosine.py
See original GitHub issueHi, I’ve trained my own dataset with 2 gpus on the same devices following here. But I am confused about how to generate speaker embedding if I want to use my own model and checkpoint to input a custom audio file like this way:
import torchaudio
from speechbrain.pretrained import EncoderClassifier
classifier = EncoderClassifier.from_hparams(source="speechbrain/spkrec-ecapa-voxceleb")
signal, fs =torchaudio.load('samples/audio_samples/example1.wav')
embeddings = classifier.encode_batch(signal)
Or is there any suitable way to preprocess my input file if I write something like this (I copy the function compute_embedding
from speaker_verification_cosine.py
):
import os
import torch
import torchaudio
import speechbrain as sb
from hyperpyyaml import load_hyperpyyaml
from speechbrain.utils.distributed import run_on_main
def compute_embedding(wavs, wav_lens):
"""Compute speaker embeddings.
Arguments
---------
wavs : Torch.Tensor
Tensor containing the speech waveform (batch, time).
Make sure the sample rate is fs=16000 Hz.
wav_lens: Torch.Tensor
Tensor containing the relative length for each sentence
in the length (e.g., [0.8 0.6 1.0])
"""
with torch.no_grad():
feats = params["compute_features"](wavs)
feats = params["mean_var_norm"](feats, wav_lens)
embeddings = params["embedding_model"](feats, wav_lens)
embeddings = params["mean_var_norm_emb"](
embeddings, torch.ones(embeddings.shape[0]).to(embeddings.device)
)
return embeddings.squeeze(1)
arg = ['hparams/verification_ecapa.yaml', '--data_folder=/disk/data/lrs3/lrs3_wav/']
params_file, run_opts, overrides = sb.core.parse_arguments(arg[:])
with open(params_file) as fin:
params = load_hyperpyyaml(fin, overrides)
run_on_main(params["pretrainer"].collect_files)
params["pretrainer"].load_collected(params["device"])
params["embedding_model"].eval()
params["embedding_model"].to(params["device"])
wavs, fs =torchaudio.load('test.wav')
lens = torch.tensor([1.0]).to(params["device"])
emb = compute_embedding(wavs, lens).unsqueeze(1)
I am not sure about what format should wavs
and lens
should be.
Or is there any simple way to do this?
Also, I cannot test my EER and minDCF when running speaker_verification_cosine.py
.
Even if I set the command line like this way:
CUDA_VISIBLE_DEVICES=0,1 python3 -m torch.distributed.launch --nproc_per_node=2 speaker_verification_cosine.py hparams/verification_ecapa.yaml --data_folder=/disk/data/wav/ --distributed_launch --distributed_backend='nccl' --data_parallel_backend
Gpu0 is always out of memory, and gpu1 seems to do nothing. Does anyone has any idea?
Issue Analytics
- State:
- Created a year ago
- Comments:7 (1 by maintainers)
Top Results From Across the Web
speechbrain/README.md at develop - VoxCeleb - GitHub
Speaker verification using ECAPA-TDNN embeddings. Run the following command to train speaker embeddings using ECAPA-TDNN: python train_speaker_embeddings.py ...
Read more >speechbrain/spkrec-ecapa-voxceleb - Hugging Face
Speaker Verification with ECAPA-TDNN embeddings on Voxceleb. This repository provides all the necessary tools to perform speaker verification with a pretrained ...
Read more >ECAPA-TDNN Embeddings for Speaker Diarization - arXiv
Learning robust speaker embeddings is a crucial step in speaker diarization. Deep neural networks can accurately capture speaker discriminative ...
Read more >Residual Information in Deep Speaker Embedding Architectures
The best performing systems included ResNet and ECAPA-TDNN architectures augmented with self-supervised learned (SSL) representations of the audio signal [28,29 ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi @anitaweng here is the function I used to extract Kaldi-based embeddings:
Please ignore related dependencies, but as you can see from the caller function, these lines in particular:
I set the length equal to the input wav length. This means we do not perform any kind of padding/chunking if the length of input wav suffices. It works for me. Maybe you wanna have a try.
Hi @anitaweng just to quickly respond to your question on
wavs
andlens
. Please take a look how it’s handled in the interface for the pretrained speaker recognition model. SpeechBrain provides aload_audio(path_x)
function. Lengths are explained here - if both files are of equal duration, they get both a ‘1.0’, otherwise the long one is at 1.0 and the other is a relative 0.xyz… factor.Do you have the same issue with single GPU training? (to sort out potential error sources)