EER is very high in speaker verification (using my own model and default pretrain model)
See original GitHub issueHello everyone,
I am currently trying to build a model for speaker verification using my own data. To do so, I duplicated the VoxCeleb recipe and created my own CSV files.
Training the x-vector model
I managed to train the model without quite smoothly using python train_speaker_embeddings.py hparams/train_x_vectors.yaml
using the following configuration:
# ################################
# Model: Speaker identification with ECAPA
# Authors: Hwidong Na & Mirco Ravanelli
# ################################
# Basic parameters
seed: 1989
__set_seed: !apply:torch.manual_seed [!ref <seed>]
output_folder: !ref results/xvectors/<seed>
save_folder: !ref <output_folder>/save
train_log: log/train.log
# Data files
data_folder: ../../../data/speechbrain
train_annotation: !ref <data_folder>/train.csv
valid_annotation: !ref <data_folder>/valid.csv
# Folder to extract data augmentation files
rir_folder: !ref <data_folder> # Change it if needed
# Use the following links for the official voxceleb splits:
verification_file: !ref <data_folder>/valid_verif.txt
skip_prep: False
ckpt_interval_minutes: 15 # save checkpoint every N min
# Training parameters
number_of_epochs: 20
batch_size: 16
lr: 0.001
lr_final: 0.0001
sample_rate: 16000
sentence_len: 3.0 # seconds
shuffle: True
random_chunk: False
# Feature parameters
n_mels: 24
left_frames: 0
right_frames: 0
deltas: False
# Number of speakers
out_n_neurons: 1621
dataloader_options:
batch_size: !ref <batch_size>
shuffle: !ref <shuffle>
num_workers: 0
# Functions
compute_features: !new:speechbrain.lobes.features.Fbank
n_mels: !ref <n_mels>
left_frames: !ref <left_frames>
right_frames: !ref <right_frames>
deltas: !ref <deltas>
embedding_model: !new:speechbrain.lobes.models.Xvector.Xvector
in_channels: !ref <n_mels>
activation: !name:torch.nn.LeakyReLU
tdnn_blocks: 5
tdnn_channels: [512, 512, 512, 512, 1500]
tdnn_kernel_sizes: [5, 3, 3, 1, 1]
tdnn_dilations: [1, 2, 3, 1, 1]
lin_neurons: 512
classifier: !new:speechbrain.lobes.models.Xvector.Classifier
input_shape: [null, null, 512]
activation: !name:torch.nn.LeakyReLU
lin_blocks: 1
lin_neurons: 512
out_neurons: !ref <out_n_neurons>
epoch_counter: !new:speechbrain.utils.epoch_loop.EpochCounter
limit: !ref <number_of_epochs>
augment_wavedrop: !new:speechbrain.lobes.augment.TimeDomainSpecAugment
sample_rate: !ref <sample_rate>
speeds: [100]
augment_speed: !new:speechbrain.lobes.augment.TimeDomainSpecAugment
sample_rate: !ref <sample_rate>
speeds: [95, 100, 105]
add_rev: !new:speechbrain.lobes.augment.EnvCorrupt
openrir_folder: !ref <rir_folder>
openrir_max_noise_len: 3.0 # seconds
reverb_prob: 1.0
noise_prob: 0.0
noise_snr_low: 0
noise_snr_high: 15
rir_scale_factor: 1.0
add_noise: !new:speechbrain.lobes.augment.EnvCorrupt
openrir_folder: !ref <rir_folder>
openrir_max_noise_len: 3.0 # seconds
reverb_prob: 0.0
noise_prob: 1.0
noise_snr_low: 0
noise_snr_high: 15
rir_scale_factor: 1.0
add_rev_noise: !new:speechbrain.lobes.augment.EnvCorrupt
openrir_folder: !ref <rir_folder>
openrir_max_noise_len: 3.0 # seconds
reverb_prob: 1.0
noise_prob: 1.0
noise_snr_low: 0
noise_snr_high: 15
rir_scale_factor: 1.0
# Definition of the augmentation pipeline.
# If concat_augment = False, the augmentation techniques are applied
# in sequence. If concat_augment = True, all the augmented signals
# are concatenated in a single big batch.
augment_pipeline: [
!ref <augment_wavedrop>,
!ref <augment_speed>,
!ref <add_rev>,
!ref <add_noise>,
!ref <add_rev_noise>
]
concat_augment: True
mean_var_norm: !new:speechbrain.processing.features.InputNormalization
norm_type: sentence
std_norm: False
modules:
compute_features: !ref <compute_features>
augment_wavedrop: !ref <augment_wavedrop>
augment_speed: !ref <augment_speed>
add_rev: !ref <add_rev>
add_noise: !ref <add_noise>
add_rev_noise: !ref <add_rev_noise>
embedding_model: !ref <embedding_model>
classifier: !ref <classifier>
mean_var_norm: !ref <mean_var_norm>
# Cost + optimization
compute_cost: !name:speechbrain.nnet.losses.nll_loss
compute_error: !name:speechbrain.nnet.losses.classification_error
opt_class: !name:torch.optim.Adam
lr: !ref <lr>
weight_decay: 0.000002
lr_annealing: !new:speechbrain.nnet.schedulers.LinearScheduler
initial_value: !ref <lr>
final_value: !ref <lr_final>
epoch_count: !ref <number_of_epochs>
# Logging + checkpoints
train_logger: !new:speechbrain.utils.train_logger.FileTrainLogger
save_file: !ref <train_log>
error_stats: !name:speechbrain.utils.metric_stats.MetricStats
metric: !name:speechbrain.nnet.losses.classification_error
reduction: batch
checkpointer: !new:speechbrain.utils.checkpoints.Checkpointer
checkpoints_dir: !ref <save_folder>
recoverables:
embedding_model: !ref <embedding_model>
classifier: !ref <classifier>
normalizer: !ref <mean_var_norm>
counter: !ref <epoch_counter>
Question: The ErrorRate after the last epoch is around 0.8 (in the <output_folder>/log.txt
file). Does it mean 0.8% or 80%? Is it very good or very bad?
Speaker verification with PLDA
Then, I performed speaker verification using another set of data, similar to the training and validation sets I used during the x-vector model training. To do so, I used the command python speaker_verification_plda.py hparams/verification_plda_xvector.yaml
with the following configuration:
# ################################
# Model: Speaker Verification Baseline using PLDA
# Authors: Nauman Dawalatabad & Mirco Ravanelli 2020
# ################################
seed: 1989
__set_seed: !apply:torch.manual_seed [!ref <seed>]
# Folders and train_log file
data_folder: ../../../data/speechbrain # use vox 1, vox2, or vox1+vox2 datasets
output_folder: !ref results/xvectors/<seed>
save_folder: !ref <output_folder>/save
device: 'cuda:0'
# Use the following links for the official voxceleb splits:
verification_file: !ref <data_folder>/test_verif.txt
# Here, the pretrained embedding model trained with train_speaker_embeddings.py hparams/train_ecapa_tdnn.yaml
# is downloaded from the speechbrain HuggingFace repository.
# However, a local path pointing to a directory containing your checkpoints may also be specified
# instead (see pretrainer below)
pretrain_path: !ref <save_folder>
#pretrain_path: speechbrain/spkrec-xvect-voxceleb
# csv files
train_data: !ref <data_folder>/train.csv
enrol_data: !ref <data_folder>/enrol.csv
test_data: !ref <data_folder>/test.csv
batch_size: 128
n_train_snts: 300000 # used for normalization stats
# Feature parameters
n_mels: 24
emb_dim: 512
# Dataloader options
train_dataloader_opts:
batch_size: !ref <batch_size>
enrol_dataloader_opts:
batch_size: !ref <batch_size>
test_dataloader_opts:
batch_size: !ref <batch_size>
# Model params
compute_features: !new:speechbrain.lobes.features.Fbank
n_mels: !ref <n_mels>
mean_var_norm: !new:speechbrain.processing.features.InputNormalization
norm_type: sentence
std_norm: False
embedding_model: !new:speechbrain.lobes.models.Xvector.Xvector
in_channels: !ref <n_mels>
activation: !name:torch.nn.LeakyReLU
tdnn_blocks: 5
tdnn_channels: [512, 512, 512, 512, 1500]
tdnn_kernel_sizes: [5, 3, 3, 1, 1]
tdnn_dilations: [1, 2, 3, 1, 1]
lin_neurons: !ref <emb_dim>
mean_var_norm_emb: !new:speechbrain.processing.features.InputNormalization
norm_type: global
std_norm: False
compute_plda: !new:speechbrain.processing.PLDA_LDA.PLDA
rank_f: 100
nb_iter: 10
scaling_factor: 0.05
pretrainer: !new:speechbrain.utils.parameter_transfer.Pretrainer
collect_in: !ref <save_folder>
loadables:
embedding_model: !ref <embedding_model>
#mean_var_norm_emb: !ref <mean_var_norm_emb>
paths:
embedding_model: !ref <pretrain_path>/embedding_model.ckpt
#mean_var_norm_emb: !ref <pretrain_path>/normalizer.ckpt
The task results with an EER=47%, which is very bad.
To determine what caused this bad performance, I did the same using the VoxCeleb pretrain model (pretrain_path: speechbrain/spkrec-xvect-voxceleb
), and the EER was 46%.
As far as I can tell, the problem comes from my test data, but I want to make sure I did everything right on my end, and that all my (configuration) files are not the source of the bad performance.
CSV files I provided
Here is a sneak peak of my csv files:
train.csv
ID,duration,wav,start,stop,spk_id
20000525_1130_1230_rfi_fm_dga-63914-69724-Isabelle_JAMMOT,5810,/local_disk/orion/anaisaurus/fabiole2/data/speechbrain/wav/train/20000525_1130_1230_rfi_fm_dga.wav,63914,69724,Isabelle_JAMMOT
20000525_1130_1230_rfi_fm_dga-80774-85144-Isabelle_JAMMOT,4370,/local_disk/orion/anaisaurus/fabiole2/data/speechbrain/wav/train/20000525_1130_1230_rfi_fm_dga.wav,80774,85144,Isabelle_JAMMOT
20000525_1130_1230_rfi_fm_dga-85144-89820-Isabelle_JAMMOT,4676,/local_disk/orion/anaisaurus/fabiole2/data/speechbrain/wav/train/20000525_1130_1230_rfi_fm_dga.wav,85144,89820,Isabelle_JAMMOT
20000525_1130_1230_rfi_fm_dga-92169-95529-Frederic_DOMONT,3360,/local_disk/orion/anaisaurus/fabiole2/data/speechbrain/wav/train/20000525_1130_1230_rfi_fm_dga.wav,92169,95529,Frederic_DOMONT
20000525_1130_1230_rfi_fm_dga-99648-103582-Frederic_DOMONT,3934,/local_disk/orion/anaisaurus/fabiole2/data/speechbrain/wav/train/20000525_1130_1230_rfi_fm_dga.wav,99648,103582,Frederic_DOMONT
valid.csv
ID,duration,wav,start,stop,spk_id
20000524_1130_1230_rfi_fm_dga-579528-584110-Isabelle_JAMMOT,4582,/local_disk/orion/anaisaurus/fabiole2/data/speechbrain/wav/train/20000524_1130_1230_rfi_fm_dga.wav,579528,584110,Isabelle_JAMMOT
20001020_1128_1228_rfi-94235-99443-Isabelle_JAMMOT,5208,/local_disk/orion/anaisaurus/fabiole2/data/speechbrain/wav/train/20001020_1128_1228_rfi.wav,94235,99443,Isabelle_JAMMOT
20000524_1130_1230_rfi_fm_dga-156324-160770-Isabelle_JAMMOT,4446,/local_disk/orion/anaisaurus/fabiole2/data/speechbrain/wav/train/20000524_1130_1230_rfi_fm_dga.wav,156324,160770,Isabelle_JAMMOT
20001005_1128_1228_rfi-122713-127483-Isabelle_JAMMOT,4770,/local_disk/orion/anaisaurus/fabiole2/data/speechbrain/wav/train/20001005_1128_1228_rfi.wav,122713,127483,Isabelle_JAMMOT
20000907_1130_1230_rfi_fm_dga-90994-94994-Isabelle_JAMMOT,4000,/local_disk/orion/anaisaurus/fabiole2/data/speechbrain/wav/train/20000907_1130_1230_rfi_fm_dga.wav,90994,94994,Isabelle_JAMMOT
valid_verif.txt
1 19991101_0700_0800_inter-2259370-2264220-Bernadette_CHAMONAZ 20030415_0700_0800_FRANCEINTER_DGA-2354961-2360023-Bernadette_CHAMONAZ
1 20030416_0700_0800_FRANCEINTER_DGA-2157654-2162469-Bernadette_CHAMONAZ 20030414_0700_0800_FRANCEINTER_DGA-2047227-2051027-Bernadette_CHAMONAZ
1 19991102_0700_0800_inter-2130587-2134019-Bernadette_CHAMONAZ 19991029_0700_0800_inter-2505187-2510851-Bernadette_CHAMONAZ
1 19991102_0700_0800_inter-2003082-2006765-Bernadette_CHAMONAZ 20030416_0700_0800_FRANCEINTER_DGA-2231112-2235057-Bernadette_CHAMONAZ
1 19991025_0700_0800_inter-408869-412761-Bernadette_CHAMONAZ 19991102_0700_0800_inter-2349119-2352721-Bernadette_CHAMONAZ
enrol.csv
ID,duration,wav,start,stop,spk_id
cavousregardeledebat_2014-06-05_2233-0000-112896-Arnaud_ARDOIN,112896,/local_disk/orion/anaisaurus/fabiole2/data/speechbrain/wav/valid/cavousregardeledebat_2014-06-05_2233-0000-112896-Arnaud_ARDOIN.wav,0,112896,Arnaud_ARDOIN
cavousregardeledebat_2013-10-24-0000-16065-Arnaud_ARDOIN,16065,/local_disk/orion/anaisaurus/fabiole2/data/speechbrain/wav/valid/cavousregardeledebat_2013-10-24-0000-16065-Arnaud_ARDOIN.wav,0,16065,Arnaud_ARDOIN
cavousregardeledebat_2013-10-24-0000-51381-Arnaud_ARDOIN,51381,/local_disk/orion/anaisaurus/fabiole2/data/speechbrain/wav/valid/cavousregardeledebat_2013-10-24-0000-51381-Arnaud_ARDOIN.wav,0,51381,Arnaud_ARDOIN
cavousregardeledebat_2013-11-18-0000-3136-Arnaud_ARDOIN,3136,/local_disk/orion/anaisaurus/fabiole2/data/speechbrain/wav/valid/cavousregardeledebat_2013-11-18-0000-3136-Arnaud_ARDOIN.wav,0,3136,Arnaud_ARDOIN
cavousregardeledebat_2013-11-18-0000-13839-Arnaud_ARDOIN,13839,/local_disk/orion/anaisaurus/fabiole2/data/speechbrain/wav/valid/cavousregardeledebat_2013-11-18-0000-13839-Arnaud_ARDOIN.wav,0,13839,Arnaud_ARDOIN
test.csv
ID,duration,wav,start,stop,spk_id
cavousregardeledebat_2014-02-12-0000-120772-Arnaud_ARDOIN,120772,/local_disk/orion/anaisaurus/fabiole2/data/speechbrain/wav/valid/cavousregardeledebat_2014-02-12-0000-120772-Arnaud_ARDOIN.wav,0,120772,Arnaud_ARDOIN
cavousregardeledebat_2014-06-17_2232-0000-106681-Arnaud_ARDOIN,106681,/local_disk/orion/anaisaurus/fabiole2/data/speechbrain/wav/valid/cavousregardeledebat_2014-06-17_2232-0000-106681-Arnaud_ARDOIN.wav,0,106681,Arnaud_ARDOIN
cavousregardeledebat_2014-05-15_2235-0000-80574-Arnaud_ARDOIN,80574,/local_disk/orion/anaisaurus/fabiole2/data/speechbrain/wav/valid/cavousregardeledebat_2014-05-15_2235-0000-80574-Arnaud_ARDOIN.wav,0,80574,Arnaud_ARDOIN
cavousregardeledebat_2014-06-27_1436-0000-128230-Arnaud_ARDOIN,128230,/local_disk/orion/anaisaurus/fabiole2/data/speechbrain/wav/valid/cavousregardeledebat_2014-06-27_1436-0000-128230-Arnaud_ARDOIN.wav,0,128230,Arnaud_ARDOIN
cavousregardeledebat_2014-06-24_2231-0000-106757-Arnaud_ARDOIN,106757,/local_disk/orion/anaisaurus/fabiole2/data/speechbrain/wav/valid/cavousregardeledebat_2014-06-24_2231-0000-106757-Arnaud_ARDOIN.wav,0,106757,Arnaud_ARDOIN
test_verif.txt
1 cavousregardeledebat_2014-06-05_2233-0000-112896-Arnaud_ARDOIN cavousregardeledebat_2014-02-12-0000-120772-Arnaud_ARDOIN
1 cavousregardeledebat_2013-10-24-0000-16065-Arnaud_ARDOIN cavousregardeledebat_2014-06-17_2232-0000-106681-Arnaud_ARDOIN
1 cavousregardeledebat_2013-10-24-0000-51381-Arnaud_ARDOIN cavousregardeledebat_2014-05-15_2235-0000-80574-Arnaud_ARDOIN
1 cavousregardeledebat_2013-11-18-0000-3136-Arnaud_ARDOIN cavousregardeledebat_2014-06-27_1436-0000-128230-Arnaud_ARDOIN
1 cavousregardeledebat_2013-11-18-0000-13839-Arnaud_ARDOIN cavousregardeledebat_2014-06-24_2231-0000-106757-Arnaud_ARDOIN
What I would like to know is why I obtained such bad EERs. And I also would like to check whether I did everything right on my end. As there is very little documentation on how to use this recipe with your own data, I may have missed something.
Thank you in advance!
Issue Analytics
- State:
- Created 2 years ago
- Comments:9
Top GitHub Comments
Yes! I realized that I misunderstood the csv files. I thought the start and stop columns were timestamps, and they were actually the number of samples. This is why I obtained these bad results. I corrected my mistakes and retrained the model successfully and obtained a final EER of 10%.
@anaisaurus Hi, any news on that?