question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

EER is very high in speaker verification (using my own model and default pretrain model)

See original GitHub issue

Hello everyone,

I am currently trying to build a model for speaker verification using my own data. To do so, I duplicated the VoxCeleb recipe and created my own CSV files.

Training the x-vector model

I managed to train the model without quite smoothly using python train_speaker_embeddings.py hparams/train_x_vectors.yaml using the following configuration:

# ################################
# Model: Speaker identification with ECAPA
# Authors: Hwidong Na & Mirco Ravanelli
# ################################

# Basic parameters
seed: 1989
__set_seed: !apply:torch.manual_seed [!ref <seed>]
output_folder: !ref results/xvectors/<seed>
save_folder: !ref <output_folder>/save
train_log: log/train.log

# Data files
data_folder: ../../../data/speechbrain
train_annotation: !ref <data_folder>/train.csv
valid_annotation: !ref <data_folder>/valid.csv

# Folder to extract data augmentation files
rir_folder: !ref <data_folder> # Change it if needed

# Use the following links for the official voxceleb splits:
verification_file: !ref <data_folder>/valid_verif.txt

skip_prep: False
ckpt_interval_minutes: 15 # save checkpoint every N min

# Training parameters
number_of_epochs: 20
batch_size: 16
lr: 0.001
lr_final: 0.0001

sample_rate: 16000
sentence_len: 3.0 # seconds
shuffle: True
random_chunk: False

# Feature parameters
n_mels: 24
left_frames: 0
right_frames: 0
deltas: False

# Number of speakers
out_n_neurons: 1621

dataloader_options:
    batch_size: !ref <batch_size>
    shuffle: !ref <shuffle>
    num_workers: 0

# Functions
compute_features: !new:speechbrain.lobes.features.Fbank
    n_mels: !ref <n_mels>
    left_frames: !ref <left_frames>
    right_frames: !ref <right_frames>
    deltas: !ref <deltas>

embedding_model: !new:speechbrain.lobes.models.Xvector.Xvector
    in_channels: !ref <n_mels>
    activation: !name:torch.nn.LeakyReLU
    tdnn_blocks: 5
    tdnn_channels: [512, 512, 512, 512, 1500]
    tdnn_kernel_sizes: [5, 3, 3, 1, 1]
    tdnn_dilations: [1, 2, 3, 1, 1]
    lin_neurons: 512

classifier: !new:speechbrain.lobes.models.Xvector.Classifier
    input_shape: [null, null, 512]
    activation: !name:torch.nn.LeakyReLU
    lin_blocks: 1
    lin_neurons: 512
    out_neurons: !ref <out_n_neurons>

epoch_counter: !new:speechbrain.utils.epoch_loop.EpochCounter
    limit: !ref <number_of_epochs>


augment_wavedrop: !new:speechbrain.lobes.augment.TimeDomainSpecAugment
    sample_rate: !ref <sample_rate>
    speeds: [100]

augment_speed: !new:speechbrain.lobes.augment.TimeDomainSpecAugment
    sample_rate: !ref <sample_rate>
    speeds: [95, 100, 105]

add_rev: !new:speechbrain.lobes.augment.EnvCorrupt
    openrir_folder: !ref <rir_folder>
    openrir_max_noise_len: 3.0  # seconds
    reverb_prob: 1.0
    noise_prob: 0.0
    noise_snr_low: 0
    noise_snr_high: 15
    rir_scale_factor: 1.0

add_noise: !new:speechbrain.lobes.augment.EnvCorrupt
    openrir_folder: !ref <rir_folder>
    openrir_max_noise_len: 3.0  # seconds
    reverb_prob: 0.0
    noise_prob: 1.0
    noise_snr_low: 0
    noise_snr_high: 15
    rir_scale_factor: 1.0

add_rev_noise: !new:speechbrain.lobes.augment.EnvCorrupt
    openrir_folder: !ref <rir_folder>
    openrir_max_noise_len: 3.0  # seconds
    reverb_prob: 1.0
    noise_prob: 1.0
    noise_snr_low: 0
    noise_snr_high: 15
    rir_scale_factor: 1.0


# Definition of the augmentation pipeline.
# If concat_augment = False, the augmentation techniques are applied
# in sequence. If concat_augment = True, all the augmented signals
# are concatenated in a single big batch.
augment_pipeline: [
    !ref <augment_wavedrop>,
    !ref <augment_speed>,
    !ref <add_rev>,
    !ref <add_noise>,
    !ref <add_rev_noise>
]
concat_augment: True

mean_var_norm: !new:speechbrain.processing.features.InputNormalization
    norm_type: sentence
    std_norm: False

modules:
    compute_features: !ref <compute_features>
    augment_wavedrop: !ref <augment_wavedrop>
    augment_speed: !ref <augment_speed>
    add_rev: !ref <add_rev>
    add_noise: !ref <add_noise>
    add_rev_noise: !ref <add_rev_noise>
    embedding_model: !ref <embedding_model>
    classifier: !ref <classifier>
    mean_var_norm: !ref <mean_var_norm>

# Cost + optimization
compute_cost: !name:speechbrain.nnet.losses.nll_loss
compute_error: !name:speechbrain.nnet.losses.classification_error

opt_class: !name:torch.optim.Adam
    lr: !ref <lr>
    weight_decay: 0.000002

lr_annealing: !new:speechbrain.nnet.schedulers.LinearScheduler
    initial_value: !ref <lr>
    final_value: !ref <lr_final>
    epoch_count: !ref <number_of_epochs>

# Logging + checkpoints
train_logger: !new:speechbrain.utils.train_logger.FileTrainLogger
    save_file: !ref <train_log>

error_stats: !name:speechbrain.utils.metric_stats.MetricStats
    metric: !name:speechbrain.nnet.losses.classification_error
        reduction: batch

checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer
    checkpoints_dir: !ref <save_folder>
    recoverables:
        embedding_model: !ref <embedding_model>
        classifier: !ref <classifier>
        normalizer: !ref <mean_var_norm>
        counter: !ref <epoch_counter>

Question: The ErrorRate after the last epoch is around 0.8 (in the <output_folder>/log.txt file). Does it mean 0.8% or 80%? Is it very good or very bad?

Speaker verification with PLDA

Then, I performed speaker verification using another set of data, similar to the training and validation sets I used during the x-vector model training. To do so, I used the command python speaker_verification_plda.py hparams/verification_plda_xvector.yaml with the following configuration:

# ################################
# Model: Speaker Verification Baseline using PLDA
# Authors: Nauman Dawalatabad & Mirco Ravanelli 2020
# ################################

seed: 1989
__set_seed: !apply:torch.manual_seed [!ref <seed>]

# Folders and train_log file
data_folder: ../../../data/speechbrain  # use vox 1, vox2, or vox1+vox2 datasets
output_folder: !ref results/xvectors/<seed>
save_folder: !ref <output_folder>/save
device: 'cuda:0'

# Use the following links for the official voxceleb splits:
verification_file: !ref <data_folder>/test_verif.txt

# Here, the pretrained embedding model trained with train_speaker_embeddings.py hparams/train_ecapa_tdnn.yaml
# is downloaded from the speechbrain HuggingFace repository.
# However, a local path pointing to a directory containing your checkpoints may also be specified
# instead (see pretrainer below)
pretrain_path: !ref <save_folder>
#pretrain_path: speechbrain/spkrec-xvect-voxceleb

# csv files
train_data: !ref <data_folder>/train.csv
enrol_data: !ref <data_folder>/enrol.csv
test_data: !ref <data_folder>/test.csv

batch_size: 128
n_train_snts: 300000 # used for normalization stats

# Feature parameters
n_mels: 24
emb_dim: 512

# Dataloader options
train_dataloader_opts:
    batch_size: !ref <batch_size>

enrol_dataloader_opts:
    batch_size: !ref <batch_size>

test_dataloader_opts:
    batch_size: !ref <batch_size>

# Model params
compute_features: !new:speechbrain.lobes.features.Fbank
    n_mels: !ref <n_mels>

mean_var_norm: !new:speechbrain.processing.features.InputNormalization
    norm_type: sentence
    std_norm: False

embedding_model: !new:speechbrain.lobes.models.Xvector.Xvector
    in_channels: !ref <n_mels>
    activation: !name:torch.nn.LeakyReLU
    tdnn_blocks: 5
    tdnn_channels: [512, 512, 512, 512, 1500]
    tdnn_kernel_sizes: [5, 3, 3, 1, 1]
    tdnn_dilations: [1, 2, 3, 1, 1]
    lin_neurons: !ref <emb_dim>

mean_var_norm_emb: !new:speechbrain.processing.features.InputNormalization
    norm_type: global
    std_norm: False

compute_plda: !new:speechbrain.processing.PLDA_LDA.PLDA
    rank_f: 100
    nb_iter: 10
    scaling_factor: 0.05

pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
    collect_in: !ref <save_folder>
    loadables:
        embedding_model: !ref <embedding_model>
        #mean_var_norm_emb: !ref <mean_var_norm_emb>
    paths:
        embedding_model: !ref <pretrain_path>/embedding_model.ckpt
        #mean_var_norm_emb: !ref <pretrain_path>/normalizer.ckpt

The task results with an EER=47%, which is very bad. To determine what caused this bad performance, I did the same using the VoxCeleb pretrain model (pretrain_path: speechbrain/spkrec-xvect-voxceleb), and the EER was 46%.

As far as I can tell, the problem comes from my test data, but I want to make sure I did everything right on my end, and that all my (configuration) files are not the source of the bad performance.

CSV files I provided

Here is a sneak peak of my csv files:

  • train.csv
ID,duration,wav,start,stop,spk_id
20000525_1130_1230_rfi_fm_dga-63914-69724-Isabelle_JAMMOT,5810,/local_disk/orion/anaisaurus/fabiole2/data/speechbrain/wav/train/20000525_1130_1230_rfi_fm_dga.wav,63914,69724,Isabelle_JAMMOT
20000525_1130_1230_rfi_fm_dga-80774-85144-Isabelle_JAMMOT,4370,/local_disk/orion/anaisaurus/fabiole2/data/speechbrain/wav/train/20000525_1130_1230_rfi_fm_dga.wav,80774,85144,Isabelle_JAMMOT
20000525_1130_1230_rfi_fm_dga-85144-89820-Isabelle_JAMMOT,4676,/local_disk/orion/anaisaurus/fabiole2/data/speechbrain/wav/train/20000525_1130_1230_rfi_fm_dga.wav,85144,89820,Isabelle_JAMMOT
20000525_1130_1230_rfi_fm_dga-92169-95529-Frederic_DOMONT,3360,/local_disk/orion/anaisaurus/fabiole2/data/speechbrain/wav/train/20000525_1130_1230_rfi_fm_dga.wav,92169,95529,Frederic_DOMONT
20000525_1130_1230_rfi_fm_dga-99648-103582-Frederic_DOMONT,3934,/local_disk/orion/anaisaurus/fabiole2/data/speechbrain/wav/train/20000525_1130_1230_rfi_fm_dga.wav,99648,103582,Frederic_DOMONT
  • valid.csv
ID,duration,wav,start,stop,spk_id
20000524_1130_1230_rfi_fm_dga-579528-584110-Isabelle_JAMMOT,4582,/local_disk/orion/anaisaurus/fabiole2/data/speechbrain/wav/train/20000524_1130_1230_rfi_fm_dga.wav,579528,584110,Isabelle_JAMMOT
20001020_1128_1228_rfi-94235-99443-Isabelle_JAMMOT,5208,/local_disk/orion/anaisaurus/fabiole2/data/speechbrain/wav/train/20001020_1128_1228_rfi.wav,94235,99443,Isabelle_JAMMOT
20000524_1130_1230_rfi_fm_dga-156324-160770-Isabelle_JAMMOT,4446,/local_disk/orion/anaisaurus/fabiole2/data/speechbrain/wav/train/20000524_1130_1230_rfi_fm_dga.wav,156324,160770,Isabelle_JAMMOT
20001005_1128_1228_rfi-122713-127483-Isabelle_JAMMOT,4770,/local_disk/orion/anaisaurus/fabiole2/data/speechbrain/wav/train/20001005_1128_1228_rfi.wav,122713,127483,Isabelle_JAMMOT
20000907_1130_1230_rfi_fm_dga-90994-94994-Isabelle_JAMMOT,4000,/local_disk/orion/anaisaurus/fabiole2/data/speechbrain/wav/train/20000907_1130_1230_rfi_fm_dga.wav,90994,94994,Isabelle_JAMMOT
  • valid_verif.txt
1 19991101_0700_0800_inter-2259370-2264220-Bernadette_CHAMONAZ 20030415_0700_0800_FRANCEINTER_DGA-2354961-2360023-Bernadette_CHAMONAZ
1 20030416_0700_0800_FRANCEINTER_DGA-2157654-2162469-Bernadette_CHAMONAZ 20030414_0700_0800_FRANCEINTER_DGA-2047227-2051027-Bernadette_CHAMONAZ
1 19991102_0700_0800_inter-2130587-2134019-Bernadette_CHAMONAZ 19991029_0700_0800_inter-2505187-2510851-Bernadette_CHAMONAZ
1 19991102_0700_0800_inter-2003082-2006765-Bernadette_CHAMONAZ 20030416_0700_0800_FRANCEINTER_DGA-2231112-2235057-Bernadette_CHAMONAZ
1 19991025_0700_0800_inter-408869-412761-Bernadette_CHAMONAZ 19991102_0700_0800_inter-2349119-2352721-Bernadette_CHAMONAZ
  • enrol.csv
ID,duration,wav,start,stop,spk_id
cavousregardeledebat_2014-06-05_2233-0000-112896-Arnaud_ARDOIN,112896,/local_disk/orion/anaisaurus/fabiole2/data/speechbrain/wav/valid/cavousregardeledebat_2014-06-05_2233-0000-112896-Arnaud_ARDOIN.wav,0,112896,Arnaud_ARDOIN
cavousregardeledebat_2013-10-24-0000-16065-Arnaud_ARDOIN,16065,/local_disk/orion/anaisaurus/fabiole2/data/speechbrain/wav/valid/cavousregardeledebat_2013-10-24-0000-16065-Arnaud_ARDOIN.wav,0,16065,Arnaud_ARDOIN
cavousregardeledebat_2013-10-24-0000-51381-Arnaud_ARDOIN,51381,/local_disk/orion/anaisaurus/fabiole2/data/speechbrain/wav/valid/cavousregardeledebat_2013-10-24-0000-51381-Arnaud_ARDOIN.wav,0,51381,Arnaud_ARDOIN
cavousregardeledebat_2013-11-18-0000-3136-Arnaud_ARDOIN,3136,/local_disk/orion/anaisaurus/fabiole2/data/speechbrain/wav/valid/cavousregardeledebat_2013-11-18-0000-3136-Arnaud_ARDOIN.wav,0,3136,Arnaud_ARDOIN
cavousregardeledebat_2013-11-18-0000-13839-Arnaud_ARDOIN,13839,/local_disk/orion/anaisaurus/fabiole2/data/speechbrain/wav/valid/cavousregardeledebat_2013-11-18-0000-13839-Arnaud_ARDOIN.wav,0,13839,Arnaud_ARDOIN
  • test.csv
ID,duration,wav,start,stop,spk_id
cavousregardeledebat_2014-02-12-0000-120772-Arnaud_ARDOIN,120772,/local_disk/orion/anaisaurus/fabiole2/data/speechbrain/wav/valid/cavousregardeledebat_2014-02-12-0000-120772-Arnaud_ARDOIN.wav,0,120772,Arnaud_ARDOIN
cavousregardeledebat_2014-06-17_2232-0000-106681-Arnaud_ARDOIN,106681,/local_disk/orion/anaisaurus/fabiole2/data/speechbrain/wav/valid/cavousregardeledebat_2014-06-17_2232-0000-106681-Arnaud_ARDOIN.wav,0,106681,Arnaud_ARDOIN
cavousregardeledebat_2014-05-15_2235-0000-80574-Arnaud_ARDOIN,80574,/local_disk/orion/anaisaurus/fabiole2/data/speechbrain/wav/valid/cavousregardeledebat_2014-05-15_2235-0000-80574-Arnaud_ARDOIN.wav,0,80574,Arnaud_ARDOIN
cavousregardeledebat_2014-06-27_1436-0000-128230-Arnaud_ARDOIN,128230,/local_disk/orion/anaisaurus/fabiole2/data/speechbrain/wav/valid/cavousregardeledebat_2014-06-27_1436-0000-128230-Arnaud_ARDOIN.wav,0,128230,Arnaud_ARDOIN
cavousregardeledebat_2014-06-24_2231-0000-106757-Arnaud_ARDOIN,106757,/local_disk/orion/anaisaurus/fabiole2/data/speechbrain/wav/valid/cavousregardeledebat_2014-06-24_2231-0000-106757-Arnaud_ARDOIN.wav,0,106757,Arnaud_ARDOIN
  • test_verif.txt
1 cavousregardeledebat_2014-06-05_2233-0000-112896-Arnaud_ARDOIN cavousregardeledebat_2014-02-12-0000-120772-Arnaud_ARDOIN
1 cavousregardeledebat_2013-10-24-0000-16065-Arnaud_ARDOIN cavousregardeledebat_2014-06-17_2232-0000-106681-Arnaud_ARDOIN
1 cavousregardeledebat_2013-10-24-0000-51381-Arnaud_ARDOIN cavousregardeledebat_2014-05-15_2235-0000-80574-Arnaud_ARDOIN
1 cavousregardeledebat_2013-11-18-0000-3136-Arnaud_ARDOIN cavousregardeledebat_2014-06-27_1436-0000-128230-Arnaud_ARDOIN
1 cavousregardeledebat_2013-11-18-0000-13839-Arnaud_ARDOIN cavousregardeledebat_2014-06-24_2231-0000-106757-Arnaud_ARDOIN

What I would like to know is why I obtained such bad EERs. And I also would like to check whether I did everything right on my end. As there is very little documentation on how to use this recipe with your own data, I may have missed something.

Thank you in advance!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:9

github_iconTop GitHub Comments

3reactions
anaisauruscommented, Oct 14, 2021

Yes! I realized that I misunderstood the csv files. I thought the start and stop columns were timestamps, and they were actually the number of samples. This is why I obtained these bad results. I corrected my mistakes and retrained the model successfully and obtained a final EER of 10%.

0reactions
TParcolletcommented, Oct 12, 2021

@anaisaurus Hi, any news on that?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Training A Rudimentary Speaker Verification Model With ...
In this short article, I will describe the different stages involved in developing the voice authentication model and also discuss some ...
Read more >
arXiv:2207.05506v1 [eess.AS] 12 Jul 2022
State-of-the-art speaker verification systems are inherently dependent on some kind of human supervision as they are.
Read more >
WavLM - Hugging Face
In this paper, we propose a new pre-trained model, WavLM, ... WavLM performs especially well on speaker verification, speaker identification, and speaker ...
Read more >
Voxceleb: Large-scale speaker verification in the wild
We use this method to curate VoxCeleb, a large-scale dataset with over a million utterances ... (2017) and our own model trained on...
Read more >
Create i-vector system - MATLAB - MathWorks
ivs = ivectorSystem creates a default i-vector system. ... To perform speaker verification, call verify with the audio signal and specify the speaker...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found