Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Recreating TIMIT results for phoneme recognition

See original GitHub issue

❓ Questions and Help

Before asking:

search the issues.
Referenced this issue #2922
search the docs.

What is your question?

I am trying to replicate the wav2vec2.0 research paper’s results for phoneme recognition on TIMIT.

As an intermediate step, I have fine-tuned a Wav2Vec 2.0 Large with No finetuning model using phonemes from librispeech’s dev-clean dataset trained for 1760 epochs. When evaluating the best checkpoint after 1760 epochs on a subset of the dev-clean dataset (the same as the training set), I found that the WER (actually PER) is 100 because every prediction is an empty tensor.

My question is: is there something wrong with my setup that is causing these empty predictions? Do I just need to train for more epochs?

I only trained the model for ~14 hours on a single V100, so I don’t expect the model to be stunning, but since I’m evaluating the model on a subset of the training set, I assume it should show some improvement. And as a reference, when I evaluate a checkpoint trained for only a single epoch, the one-epoch checkpoint provides non-empty predictions with a WER of 97.0932. So, the model initially provides non-empty predictions.

Code

N/A

What have you tried?

I’ll try to pretty detailed with my setup as a resource for anyone else trying to perform phoneme recognition with wav2vec2.0.

I created the train.tsv file using the steps @alexeib provides in issue #2922 using the wav2vec_manifest.py file.

Here are the files needed for training and evaluation:

.
├── dev_other.phn
├── dev_other.tsv
├── dict.phn.txt
├── train.phn
├── train.tsv
├── valid.phn
└── valid.tsv

I found the valid.tsv and valid.phn files were not necessary for training but the training code will complain if the dev_other.tsv and dev_other.phn files are missing.

Here is a sample of my `train.tsv` file from the `wav2vec_manifest.py` script

/mnt/disks/data_disk/data/librispeech/dev-clean
251/137823/251-137823-0007.wav  56640
251/137823/251-137823-0021.wav  109120
251/137823/251-137823-0023.wav  37360

And a sample of the `train.phn` file which I created with the phoneme labels matching line-by-line the paths in `train.tsv`

f ao r m ih n ah t s n ow hh w ah n s t er d ah m ah ng dh ah r eh k ah jh
m ih s t er s w ih f t k ey m ih n t ah dh ah l ih v ih ng r uw m jh ah s t dh eh n ae n d t ow l d t aa m hh aw w er iy d m ih s ih z s w ih f t ae n d s ae n d iy hh ae d b ah n
hh iy z ah g r ey t s ay ah n t ih s t

And the full `dict.phn.txt` file

aa 1
ae 1
ah 1
ao 1
aw 1
ay 1
b 1
ch 1
d 1
dh 1
eh 1
er 1
ey 1
f 1
g 1
hh 1
ih 1
iy 1
jh 1
k 1
l 1
m 1
n 1
ng 1
ow 1
oy 1
p 1
r 1
s 1
sh 1
t 1
th 1
uh 1
uw 1
v 1
w 1
y 1
z 1
zh 1

With those files, I was able to finetune the model using this train command:

fairseq-hydra-train     \
	distributed_training.distributed_port=1234     \
	task.data=/mnt/disks/data_disk/data/librispeech/fair-test/phn \
	task.labels="phn"     \
	model.w2v_path=/home/dzubke/fairseq/examples/wav2vec/models/libri960_big.pt \
	distributed_training.distributed_world_size=1 \
	+optimization.update_freq='[64]' \
	checkpoint.no_epoch_checkpoints=False \
	checkpoint.keep_best_checkpoints=1 \
	checkpoint.keep_last_epochs=1  \
	checkpoint.save_interval=10 \
	common.fp16_init_scale=1 \
	--config-dir /home/dzubke/fairseq/examples/wav2vec/config/finetuning    \
	--config-name base_1h

I initially had a hard time with the checkpoints during demo runs because the default checkpoint.save_interval is 1000, so I reduced that. Note the task.data path points to the directory where the train.tsv and other files are located. I’m using a single V100 which determines the distributed_training.distributed_world_size and +optimization.update_freq values.

The code here shows the beginning and end of my `hydra_train.log` file. As you can see, the `train_loss` value decreases by a factor of 4 over the 1760 epochs suggesting the model is improving on the loss function.

[2021-03-29 17:13:51,370][fairseq.trainer][INFO] - begin training epoch 3
[2021-03-29 17:13:51,370][fairseq_cli.train][INFO] - Start iterating over samples
[2021-03-29 17:14:19,216][fairseq_cli.train][INFO] - end of epoch 3 (average epoch stats below)
[2021-03-29 17:14:19,217][train][INFO] - {"epoch": 3, "train_loss": "1397.73", "train_ntokens": "89218", "train_nsentences": "1307", "train_nll_loss": "20.476", "train_wps": "6380.8", "train_ups": "0.07", "train_wpb": "89218", "train_bsz": "1307", "train_num_updates": "3", "train_lr": "6.14231e-07", "train_gnorm": "647.555", "train_loss_scale": "0.125", "train_train_wall": "26", "train_gb_free": "7.5", "train_wall": "90"}
...
[2021-03-30 07:38:32,216][fairseq.trainer][INFO] - begin training epoch 1761
[2021-03-30 07:38:32,217][fairseq_cli.train][INFO] - Start iterating over samples
[2021-03-30 07:39:11,820][fairseq_cli.train][INFO] - end of epoch 1761 (average epoch stats below)
[2021-03-30 07:39:11,820][train][INFO] - {"epoch": 1761, "train_loss": "335.217", "train_ntokens": "89218", "train_nsentences": "1307", "train_nll_loss": "4.911", "train_wps": "4429.1", "train_ups": "0.05", "train_wpb": "89218", "train_bsz": "1307", "train_num_updates": "3510", "train_lr": "5e-05", "train_gnorm": "13.555", "train_loss_scale": "1", "train_train_wall": "26", "train_gb_free": "7.5", "train_wall": "51982"}

I don’t fully understand how the model could decrease the training loss while also learning to predict empty tensors.

Here is my command for running evaluation:

python examples/speech_recognition/infer.py \
    /mnt/disks/data_disk/data/librispeech/fair-test/phn \
    --task audio_pretraining \
    --nbest 1 \
    --path /home/dzubke/fairseq/examples/wav2vec/outputs/2021-03-29/17-38-34/checkpoints/checkpoint_best.pt \
    --gen-subset valid \
    --results-path /home/dzubke/fairseq/examples/wav2vec/outputs/2021-03-29/17-38-34/results \
    --w2l-decoder viterbi \
    --criterion ctc \
    --labels phn 
    --max-tokens 4000000

Here are some excerpts from the evaluation script from the one-epoch checkpoint. The ~~~ drz print statement I added to shows the tokenized tensors in the hypos dict.

~~ drz: hypos: [{'tokens': tensor([15]), 'score': 0}]
INFO:__main__:HYPO:er
INFO:__main__:TARGET:dh ah s ah v ih l y ah n s p eh sh ah l ah s t s ih n ah dh er f iy l d z ae n d dh ah s p ey s f ao r s p iy p ah l hh uw hh ae d b ah n hh ow l d ih ng t ey p l ay n z ae n d m ey k ih ng s k eh ch ah z ae n d s n ae p ih ng k ae m er ah z w er ao l f l ay ih ng t ah l ow er s ih r t ih s t ah f ay n d aw t hh aw m ah ch aa k s ah jh ah n dh eh r w aa z ae n d hh w ah t k ay n d ah v l ay f ih t s ah p ao r t ah d
INFO:__main__:___________________
~~ drz: hypos: [{'tokens': tensor([15, 11, 15, 17, 15, 24, 15, 36, 15, 36, 15, 36, 15]), 'score': 0}]
INFO:__main__:HYPO:er ch er f er l er uh er uh er uh er
INFO:__main__:TARGET:t ow n iy l ae t ah m er dh ah d ih s k ah v er er w aa z b ih g ih n ih ng t ah k ae sh ih n aa n hh ih z ah t eh n sh ah n z t ah g l ao r iy ah ae n d hh ih z ih ng g r ey sh iy ey sh ah n w ih dh s ih d hh iy w aa z ao l w ey z ay dh er m ey k ih ng v oy s ae n d ih m ah jh t ao k s f ao r t eh l ah k ae s t ao r l ih s ah n ih ng t ah dh ah n uw z f er m dh ah hh ow m p l ae n ah t
INFO:__main__:___________________
...
INFO:__main__:WER: 97.09322935129387

And here are the excerpts from the evaluation for the 1760-epoch checkpoint. Note that hypos[0]['tokens'] is empty.

~~ drz: hypos: [{'tokens': tensor([]), 'score': 0}]
INFO:__main__:HYPO:
INFO:__main__:TARGET:dh ah s ah v ih l y ah n s p eh sh ah l ah s t s ih n ah dh er f iy l d z ae n d dh ah s p ey s f ao r s p iy p ah l hh uw hh ae d b ah n hh ow l d ih ng t ey p l ay n z ae n d m ey k ih ng s k eh ch ah z ae n d s n ae p ih ng k ae m er ah z w er ao l f l ay ih ng t ah l ow er s ih r t ih s t ah f ay n d aw t hh aw m ah ch aa k s ah jh ah n dh eh r w aa z ae n d hh w ah t k ay n d ah v l ay f ih t s ah p ao r t ah d
INFO:__main__:___________________
INFO:__main__:HYPO:
INFO:__main__:TARGET:t ow n iy l ae t ah m er dh ah d ih s k ah v er er w aa z b ih g ih n ih ng t ah k ae sh ih n aa n hh ih z ah t eh n sh ah n z t ah g l ao r iy ah ae n d hh ih z ih ng g r ey sh iy ey sh ah n w ih dh s ih d hh iy w aa z ao l w ey z ay dh er m ey k ih ng v oy s ae n d ih m ah jh t ao k s f ao r t eh l ah k ae s t ao r l ih s ah n ih ng t ah dh ah n uw z f er m dh ah hh ow m p l ae n ah t
INFO:__main__:___________________
....
INFO:__main__:WER: 100.0

I looked inside the `encoder_input` and `emissions` and found the values were pretty small but not uniformly zero. It's hard for me to understand what are reasonable values for the `encoder_inputs` and `emissions`.

~~~ drz: encoder_input:
~~~ drz: source,mask shape: torch.Size([8, 261760]), torch.Size([8, 261760])
~~~ drz: source 1 sum: -0.387939453125
~~~ drz: contents: {'source': tensor([[ 3.0518e-05,  3.0518e-05,  9.1553e-05,  ...,  0.0000e+00,
         -1.2207e-04, -3.0518e-05],
        [-6.7139e-04, -8.5449e-04, -7.6294e-04,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        ...,
        [-6.4087e-04, -7.3242e-04, -8.5449e-04,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00],
        [-2.1362e-04, -2.1362e-04, -2.7466e-04,  ...,  0.0000e+00,
          0.0000e+00,  0.0000e+00]], device='cuda:0'), 'padding_mask': tensor([[False, False, False,  ..., False, False, False],
        [False, False, False,  ...,  True,  True,  True],
        [False, False, False,  ...,  True,  True,  True],
        ...,
        [False, False, False,  ...,  True,  True,  True]], device='cuda:0')}

~~~ drz: model has get_logits: True
~~~ drz: emissions:
~~~ drz: emissions shape: torch.Size([24, 439, 43])
~~~ drz: contents: tensor([[[ 10.6958,  -7.8463,  -8.0892,  ...,  -7.1404,  -8.0253,  -7.1810],
         [ 10.6323,  -7.7410,  -7.9552,  ...,  -7.0427,  -7.9897,  -7.0889],
         [ 10.6305,  -7.7033,  -7.9448,  ...,  -6.9910,  -7.9354,  -7.0399],
         ...,
         [ 10.9660,  -7.8253,  -8.2720,  ...,  -7.1628,  -7.9288,  -7.2011],
         [ 10.9660,  -7.8253,  -8.2720,  ...,  -7.1628,  -7.9288,  -7.2011],
         [ 10.9660,  -7.8253,  -8.2720,  ...,  -7.1628,  -7.9288,  -7.2011]]])

For other people trying to do this, I found I also need to install the flashlight python bindings as described in the README to run evaluation. I thought I wouldn’t need to do this as the README only mentions the bindings when using a language model (which I’m not using), but I found that I needed the flashlight bindings to use the viterbi decoder.

To install flashlight, the KenLM, OpenBLAS, FFTW, dependencies weren’t too hard to install, though installing intel’s MKL was pretty annoying and it wasn’t listed as a necessary dependency for flashlight but I couldn’t get flashlight to build without it.

I’m going to try another training run using 8 V100’s to see if training longer will resolve the issue with the empty predictions.

Thanks for the help!

What’s your environment?

fairseq Version: 0.10.2
PyTorch Version: 1.7.1
OS (e.g. Linux): Ubuntu 18.04
How you installed fairseq (pip, source): pip install --editable ./
Build command you used (if compiling from source):
Python version: 3.8.2
CUDA/cuDNN version: 11.2
GPU models and configuration: Nvidia Tesla V100
Any other relevant information: Built flashlight python bindings from source

Issue Analytics

State:
Created 2 years ago
Comments:8 (2 by maintainers)

Top GitHub Comments

2reactions

dzubkecommented, Apr 5, 2021

I was able to approximately recreate the phoneme recognition results on the timit dataset. My training and validation sets were different from those used in the research paper as I used a train/validation split specified: here.

My finetuned model achieved a 8.6 PER on my timit validation set, which is very close to the 7.4 dev-PER and 8.3 test-PER values in the paper. Given the differences in the dev-sets, this represents a re-creation of the phoneme recognition results for my purposes.

This is the evaluation command I ran:

python examples/speech_recognition/infer.py     \
    /mnt/disks/data_disk/data/timit/fair/    \
    --task audio_pretraining     \
    --nbest 1     \
    --path /home/dzubke/fairseq/outputs/2021-04-03/14-29-44/checkpoints/checkpoint_best.pt     \
    --gen-subset valid     \
    --results-path /home/dzubke/fairseq/outputs/2021-04-03/14-29-44/results     \
    --w2l-decoder viterbi     \
    --criterion ctc     \
    --labels phn \
    --max-tokens 4000000

And here is the end of the evaluation output

INFO:__main__:HYPO:sil b r ih ng m iy dh ah f aa r sil k r ae sil k er s sil
INFO:__main__:TARGET:sil b r ih ng m iy dh ih f ay er sil k r ae sil k er z sil
INFO:__main__:___________________
INFO:__main__:HYPO:sil eh n w ah dx ay z dh ey w er sil
INFO:__main__:TARGET:sil eh n w ah dx ay z dh ey w er sil
INFO:__main__:___________________
INFO:__main__:HYPO:sil hh w ay sh uh sil dh ih s sil b iy s ow sil
INFO:__main__:TARGET:sil hh w ay sh ih sil dh ih s sil b iy s ow sil
INFO:__main__:___________________
INFO:__main__:HYPO:sil w ey dx ah l ih dx l w ay l sil
INFO:__main__:TARGET:sil w ey dx ah l ih dx l w ay l sil
INFO:__main__:___________________
INFO:__main__:WER: 8.608                                                                                                         
INFO:__main__:| Processed 500 sentences (18987 tokens) in 10.0s (50.25sentences/s, 1908.10 tokens/s)
INFO:__main__:| Generate valid with beam=5

You may notice the phonemes in my evlauation are different from those in the research paper. I used a different but equivalent collapsing of the original 60 timit phonemes outlined here.

Below is the training command I ran. I had to reduce the max-tokens (similar to batch-size) a few times to avoid a GPU out-of-memory error. Because of the OOM errors, I started the subsequent training runs from a checkpoint with the checkpoint.restore_file value. I’m not sure if specifying both the checkpoint.restore_file and model.w2v_path values are redundant, but using both seemed to work.

fairseq-hydra-train     \
    task.data=/mnt/disks/data_disk/data/timit/fair/ \
    model.w2v_path=/home/dzubke/fairseq/examples/wav2vec/models/libri960_big.pt  \
    checkpoint.restore_file=/home/dzubke/fairseq/outputs/2021-04-02/12-08-06/checkpoints/checkpoint_last.pt \
    dataset.max_tokens=1000000  \
    distributed_training.distributed_world_size=8 \
    +optimization.update_freq='[3]' \
    --config-dir /home/dzubke/fairseq/examples/wav2vec/config/finetuning  \
    --config-name base_10h_2021-04-01.yaml

Here are the final outputs from my hydra_train.log file

[2021-04-03 19:14:55,481][fairseq.trainer][INFO] - begin training epoch 3274
[2021-04-03 19:14:55,481][fairseq_cli.train][INFO] - Start iterating over samples
[2021-04-03 19:15:09,136][train_inner][INFO] - {"epoch": 3274, "update": 3273.308, "loss": "1.034", "ntokens": "13235.3", "nsentences": "352.5", "nll_loss": "0.028", "wps": "7770.9", "ups": "0.59", "wpb": "13235.3", "bsz": "352.5", "num_updates": "20000", "lr": "2.5e-06", "gnorm": "16.069", "loss_scale": "16", "train_wall": "149", "gb_free": "6.4", "wall": "0"}
[2021-04-03 19:15:09,136][fairseq_cli.train][INFO] - Stopping training due to num_updates: 20000 >= max_update: 20000
[2021-04-03 19:15:09,137][fairseq_cli.train][INFO] - begin validation on "valid" subset
[2021-04-03 19:15:20,483][valid][INFO] - {"epoch": 3274, "valid_loss": "45.152", "valid_ntokens": "3750", "valid_nsentences": "100", "valid_nll_loss": "1.204", "valid_uer": "8.576", "valid_wer": "91.6", "valid_raw_wer": "91.6", "valid_wps": "6373.6", "valid_wpb": "3750", "valid_bsz": "100", "valid_num_updates": "20000", "valid_best_wer": "89.8"}
~~~ drz: making checkpoint dir ~~~~
[2021-04-03 19:15:20,485][fairseq.checkpoint_utils][INFO] - Preparing to save checkpoint for epoch 3274 @ 20000 updates
[2021-04-03 19:15:20,486][fairseq.trainer][INFO] - Saving checkpoint to checkpoints/checkpoint_3274_20000.pt
[2021-04-03 19:15:28,497][fairseq.trainer][INFO] - Finished saving checkpoint to checkpoints/checkpoint_3274_20000.pt
[2021-04-03 19:15:40,214][fairseq.checkpoint_utils][INFO] - Saved checkpoint checkpoints/checkpoint_3274_20000.pt (epoch 3274 @ 20000 updates, score 91.6) (writing took 19.728936845000135 seconds)
[2021-04-03 19:15:40,215][fairseq_cli.train][INFO] - end of epoch 3274 (average epoch stats below)
[2021-04-03 19:15:40,217][train][INFO] - {"epoch": 3274, "train_loss": "1.01", "train_ntokens": "13379.8", "train_nsentences": "350", "train_nll_loss": "0.026", "train_wps": "1156.4", "train_ups": "0.09", "train_wpb": "13379.8", "train_bsz": "350", "train_num_updates": "20000", "train_lr": "2.5e-06", "train_gnorm": "14.514", "train_loss_scale": "16", "train_train_wall": "4", "train_gb_free": "6.4", "train_wall": "0"}

As Alexei had mentioned above, my previous train_nll_loss was too high at around 4. At the end of this training run, it dropped to 0.026.

The training job ran for ~3300 epochs with 20,000 updates. These training values were unchanged from the original base_10h.yaml file. I only changed the dataset.valid_subset and common.tensorboard_logdir values in the modified base_10h_2021-04-01.yaml file. Using 8-V100’s, this took roughly 24 hours to train.

Thanks, @alexeib, for the help!

1reaction

dzubkecommented, Apr 5, 2021

As a point of clarification, I began this thread training on the 39 cmu-dict phonemes in the librispeech dataset, but I ended the thread training on a different set of 39 phonemes on the timit dataset.

Top Results From Across the Web

Recreating TIMIT results for phoneme recognition · Issue #3425

I am trying to replicate the wav2vec2.0 research paper's results for phoneme recognition on TIMIT. As an intermediate step, I have fine-tuned a ......

Phoneme Recognition on the TIMIT Database - IntechOpen

The best results presented are 73.7% for correctness and 59.9% for accuracy, using the 39 phone set proposed in (Lee & Hon, 1989)...

Speech Data Augmentation for Improving Phoneme ...

In conclusion, our paper has shown that data augmen- tation, larger model size and additional non-aphasic data sources can be helpful in improving...

[0804.3269] Phoneme recognition in TIMIT with BLSTM-CTC

We compare the performance of a recurrent neural network with the best results published so far on phoneme recognition in the TIMIT database....

Evaluation of TIMIT Sentence List Equivalency with Adult ...

Current measures used to determine sentence recognition abilities in cochlear implant recipients often include tests with one talker and one rate of speech....