Recreating TIMIT results for phoneme recognition
See original GitHub issue❓ Questions and Help
Before asking:
- search the issues.
Referenced this issue #2922 - search the docs.
What is your question?
I am trying to replicate the wav2vec2.0 research paper’s results for phoneme recognition on TIMIT.
As an intermediate step, I have fine-tuned a Wav2Vec 2.0 Large
with No finetuning
model using phonemes from librispeech’s dev-clean
dataset trained for 1760 epochs. When evaluating the best checkpoint after 1760 epochs on a subset of the dev-clean
dataset (the same as the training set), I found that the WER (actually PER) is 100 because every prediction is an empty tensor.
My question is: is there something wrong with my setup that is causing these empty predictions? Do I just need to train for more epochs?
I only trained the model for ~14 hours on a single V100, so I don’t expect the model to be stunning, but since I’m evaluating the model on a subset of the training set, I assume it should show some improvement. And as a reference, when I evaluate a checkpoint trained for only a single epoch, the one-epoch checkpoint provides non-empty predictions with a WER of 97.0932. So, the model initially provides non-empty predictions.
Code
N/A
What have you tried?
I’ll try to pretty detailed with my setup as a resource for anyone else trying to perform phoneme recognition with wav2vec2.0.
I created the train.tsv
file using the steps @alexeib provides in issue #2922 using the wav2vec_manifest.py
file.
Here are the files needed for training and evaluation:
.
├── dev_other.phn
├── dev_other.tsv
├── dict.phn.txt
├── train.phn
├── train.tsv
├── valid.phn
└── valid.tsv
I found the valid.tsv
and valid.phn
files were not necessary for training but the training code will complain if the dev_other.tsv
and dev_other.phn
files are missing.
Here is a sample of my `train.tsv` file from the `wav2vec_manifest.py` script
/mnt/disks/data_disk/data/librispeech/dev-clean
251/137823/251-137823-0007.wav 56640
251/137823/251-137823-0021.wav 109120
251/137823/251-137823-0023.wav 37360
And a sample of the `train.phn` file which I created with the phoneme labels matching line-by-line the paths in `train.tsv`
f ao r m ih n ah t s n ow hh w ah n s t er d ah m ah ng dh ah r eh k ah jh
m ih s t er s w ih f t k ey m ih n t ah dh ah l ih v ih ng r uw m jh ah s t dh eh n ae n d t ow l d t aa m hh aw w er iy d m ih s ih z s w ih f t ae n d s ae n d iy hh ae d b ah n
hh iy z ah g r ey t s ay ah n t ih s t
And the full `dict.phn.txt` file
aa 1
ae 1
ah 1
ao 1
aw 1
ay 1
b 1
ch 1
d 1
dh 1
eh 1
er 1
ey 1
f 1
g 1
hh 1
ih 1
iy 1
jh 1
k 1
l 1
m 1
n 1
ng 1
ow 1
oy 1
p 1
r 1
s 1
sh 1
t 1
th 1
uh 1
uw 1
v 1
w 1
y 1
z 1
zh 1
With those files, I was able to finetune the model using this train command:
fairseq-hydra-train \
distributed_training.distributed_port=1234 \
task.data=/mnt/disks/data_disk/data/librispeech/fair-test/phn \
task.labels="phn" \
model.w2v_path=/home/dzubke/fairseq/examples/wav2vec/models/libri960_big.pt \
distributed_training.distributed_world_size=1 \
+optimization.update_freq='[64]' \
checkpoint.no_epoch_checkpoints=False \
checkpoint.keep_best_checkpoints=1 \
checkpoint.keep_last_epochs=1 \
checkpoint.save_interval=10 \
common.fp16_init_scale=1 \
--config-dir /home/dzubke/fairseq/examples/wav2vec/config/finetuning \
--config-name base_1h
I initially had a hard time with the checkpoints during demo runs because the default checkpoint.save_interval
is 1000, so I reduced that. Note the task.data
path points to the directory where the train.tsv
and other files are located. I’m using a single V100 which determines the distributed_training.distributed_world_size
and +optimization.update_freq
values.
The code here shows the beginning and end of my `hydra_train.log` file. As you can see, the `train_loss` value decreases by a factor of 4 over the 1760 epochs suggesting the model is improving on the loss function.
[2021-03-29 17:13:51,370][fairseq.trainer][INFO] - begin training epoch 3
[2021-03-29 17:13:51,370][fairseq_cli.train][INFO] - Start iterating over samples
[2021-03-29 17:14:19,216][fairseq_cli.train][INFO] - end of epoch 3 (average epoch stats below)
[2021-03-29 17:14:19,217][train][INFO] - {"epoch": 3, "train_loss": "1397.73", "train_ntokens": "89218", "train_nsentences": "1307", "train_nll_loss": "20.476", "train_wps": "6380.8", "train_ups": "0.07", "train_wpb": "89218", "train_bsz": "1307", "train_num_updates": "3", "train_lr": "6.14231e-07", "train_gnorm": "647.555", "train_loss_scale": "0.125", "train_train_wall": "26", "train_gb_free": "7.5", "train_wall": "90"}
...
[2021-03-30 07:38:32,216][fairseq.trainer][INFO] - begin training epoch 1761
[2021-03-30 07:38:32,217][fairseq_cli.train][INFO] - Start iterating over samples
[2021-03-30 07:39:11,820][fairseq_cli.train][INFO] - end of epoch 1761 (average epoch stats below)
[2021-03-30 07:39:11,820][train][INFO] - {"epoch": 1761, "train_loss": "335.217", "train_ntokens": "89218", "train_nsentences": "1307", "train_nll_loss": "4.911", "train_wps": "4429.1", "train_ups": "0.05", "train_wpb": "89218", "train_bsz": "1307", "train_num_updates": "3510", "train_lr": "5e-05", "train_gnorm": "13.555", "train_loss_scale": "1", "train_train_wall": "26", "train_gb_free": "7.5", "train_wall": "51982"}
I don’t fully understand how the model could decrease the training loss while also learning to predict empty tensors.
Here is my command for running evaluation:
python examples/speech_recognition/infer.py \
/mnt/disks/data_disk/data/librispeech/fair-test/phn \
--task audio_pretraining \
--nbest 1 \
--path /home/dzubke/fairseq/examples/wav2vec/outputs/2021-03-29/17-38-34/checkpoints/checkpoint_best.pt \
--gen-subset valid \
--results-path /home/dzubke/fairseq/examples/wav2vec/outputs/2021-03-29/17-38-34/results \
--w2l-decoder viterbi \
--criterion ctc \
--labels phn
--max-tokens 4000000
Here are some excerpts from the evaluation script from the one-epoch checkpoint. The ~~~ drz
print statement I added to shows the tokenized tensors in the hypos
dict.
~~ drz: hypos: [{'tokens': tensor([15]), 'score': 0}]
INFO:__main__:HYPO:er
INFO:__main__:TARGET:dh ah s ah v ih l y ah n s p eh sh ah l ah s t s ih n ah dh er f iy l d z ae n d dh ah s p ey s f ao r s p iy p ah l hh uw hh ae d b ah n hh ow l d ih ng t ey p l ay n z ae n d m ey k ih ng s k eh ch ah z ae n d s n ae p ih ng k ae m er ah z w er ao l f l ay ih ng t ah l ow er s ih r t ih s t ah f ay n d aw t hh aw m ah ch aa k s ah jh ah n dh eh r w aa z ae n d hh w ah t k ay n d ah v l ay f ih t s ah p ao r t ah d
INFO:__main__:___________________
~~ drz: hypos: [{'tokens': tensor([15, 11, 15, 17, 15, 24, 15, 36, 15, 36, 15, 36, 15]), 'score': 0}]
INFO:__main__:HYPO:er ch er f er l er uh er uh er uh er
INFO:__main__:TARGET:t ow n iy l ae t ah m er dh ah d ih s k ah v er er w aa z b ih g ih n ih ng t ah k ae sh ih n aa n hh ih z ah t eh n sh ah n z t ah g l ao r iy ah ae n d hh ih z ih ng g r ey sh iy ey sh ah n w ih dh s ih d hh iy w aa z ao l w ey z ay dh er m ey k ih ng v oy s ae n d ih m ah jh t ao k s f ao r t eh l ah k ae s t ao r l ih s ah n ih ng t ah dh ah n uw z f er m dh ah hh ow m p l ae n ah t
INFO:__main__:___________________
...
INFO:__main__:WER: 97.09322935129387
And here are the excerpts from the evaluation for the 1760-epoch checkpoint. Note that hypos[0]['tokens']
is empty.
~~ drz: hypos: [{'tokens': tensor([]), 'score': 0}]
INFO:__main__:HYPO:
INFO:__main__:TARGET:dh ah s ah v ih l y ah n s p eh sh ah l ah s t s ih n ah dh er f iy l d z ae n d dh ah s p ey s f ao r s p iy p ah l hh uw hh ae d b ah n hh ow l d ih ng t ey p l ay n z ae n d m ey k ih ng s k eh ch ah z ae n d s n ae p ih ng k ae m er ah z w er ao l f l ay ih ng t ah l ow er s ih r t ih s t ah f ay n d aw t hh aw m ah ch aa k s ah jh ah n dh eh r w aa z ae n d hh w ah t k ay n d ah v l ay f ih t s ah p ao r t ah d
INFO:__main__:___________________
INFO:__main__:HYPO:
INFO:__main__:TARGET:t ow n iy l ae t ah m er dh ah d ih s k ah v er er w aa z b ih g ih n ih ng t ah k ae sh ih n aa n hh ih z ah t eh n sh ah n z t ah g l ao r iy ah ae n d hh ih z ih ng g r ey sh iy ey sh ah n w ih dh s ih d hh iy w aa z ao l w ey z ay dh er m ey k ih ng v oy s ae n d ih m ah jh t ao k s f ao r t eh l ah k ae s t ao r l ih s ah n ih ng t ah dh ah n uw z f er m dh ah hh ow m p l ae n ah t
INFO:__main__:___________________
....
INFO:__main__:WER: 100.0
I looked inside the `encoder_input` and `emissions` and found the values were pretty small but not uniformly zero. It's hard for me to understand what are reasonable values for the `encoder_inputs` and `emissions`.
~~~ drz: encoder_input:
~~~ drz: source,mask shape: torch.Size([8, 261760]), torch.Size([8, 261760])
~~~ drz: source 1 sum: -0.387939453125
~~~ drz: contents: {'source': tensor([[ 3.0518e-05, 3.0518e-05, 9.1553e-05, ..., 0.0000e+00,
-1.2207e-04, -3.0518e-05],
[-6.7139e-04, -8.5449e-04, -7.6294e-04, ..., 0.0000e+00,
0.0000e+00, 0.0000e+00],
[ 0.0000e+00, 0.0000e+00, 0.0000e+00, ..., 0.0000e+00,
0.0000e+00, 0.0000e+00],
...,
[-6.4087e-04, -7.3242e-04, -8.5449e-04, ..., 0.0000e+00,
0.0000e+00, 0.0000e+00],
[-2.1362e-04, -2.1362e-04, -2.7466e-04, ..., 0.0000e+00,
0.0000e+00, 0.0000e+00]], device='cuda:0'), 'padding_mask': tensor([[False, False, False, ..., False, False, False],
[False, False, False, ..., True, True, True],
[False, False, False, ..., True, True, True],
...,
[False, False, False, ..., True, True, True]], device='cuda:0')}
~~~ drz: model has get_logits: True
~~~ drz: emissions:
~~~ drz: emissions shape: torch.Size([24, 439, 43])
~~~ drz: contents: tensor([[[ 10.6958, -7.8463, -8.0892, ..., -7.1404, -8.0253, -7.1810],
[ 10.6323, -7.7410, -7.9552, ..., -7.0427, -7.9897, -7.0889],
[ 10.6305, -7.7033, -7.9448, ..., -6.9910, -7.9354, -7.0399],
...,
[ 10.9660, -7.8253, -8.2720, ..., -7.1628, -7.9288, -7.2011],
[ 10.9660, -7.8253, -8.2720, ..., -7.1628, -7.9288, -7.2011],
[ 10.9660, -7.8253, -8.2720, ..., -7.1628, -7.9288, -7.2011]]])
For other people trying to do this, I found I also need to install the flashlight python bindings as described in the README to run evaluation. I thought I wouldn’t need to do this as the README only mentions the bindings when using a language model (which I’m not using), but I found that I needed the flashlight bindings to use the viterbi decoder.
To install flashlight, the KenLM, OpenBLAS, FFTW, dependencies weren’t too hard to install, though installing intel’s MKL was pretty annoying and it wasn’t listed as a necessary dependency for flashlight but I couldn’t get flashlight to build without it.
I’m going to try another training run using 8 V100’s to see if training longer will resolve the issue with the empty predictions.
Thanks for the help!
What’s your environment?
- fairseq Version: 0.10.2
- PyTorch Version: 1.7.1
- OS (e.g. Linux): Ubuntu 18.04
- How you installed fairseq (
pip
, source):pip install --editable ./
- Build command you used (if compiling from source):
- Python version: 3.8.2
- CUDA/cuDNN version: 11.2
- GPU models and configuration: Nvidia Tesla V100
- Any other relevant information: Built flashlight python bindings from source
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (2 by maintainers)
Top GitHub Comments
I was able to approximately recreate the phoneme recognition results on the timit dataset. My training and validation sets were different from those used in the research paper as I used a train/validation split specified: here.
My finetuned model achieved a 8.6 PER on my timit validation set, which is very close to the 7.4 dev-PER and 8.3 test-PER values in the paper. Given the differences in the dev-sets, this represents a re-creation of the phoneme recognition results for my purposes.
This is the evaluation command I ran:
And here is the end of the evaluation output
You may notice the phonemes in my evlauation are different from those in the research paper. I used a different but equivalent collapsing of the original 60 timit phonemes outlined here.
Below is the training command I ran. I had to reduce the
max-tokens
(similar to batch-size) a few times to avoid a GPU out-of-memory error. Because of the OOM errors, I started the subsequent training runs from a checkpoint with thecheckpoint.restore_file
value. I’m not sure if specifying both thecheckpoint.restore_file
andmodel.w2v_path
values are redundant, but using both seemed to work.Here are the final outputs from my hydra_train.log file
As Alexei had mentioned above, my previous
train_nll_loss
was too high at around 4. At the end of this training run, it dropped to0.026
.The training job ran for ~3300 epochs with 20,000 updates. These training values were unchanged from the original
base_10h.yaml
file. I only changed thedataset.valid_subset
andcommon.tensorboard_logdir
values in the modifiedbase_10h_2021-04-01.yaml
file. Using 8-V100’s, this took roughly 24 hours to train.Thanks, @alexeib, for the help!
As a point of clarification, I began this thread training on the 39 cmu-dict phonemes in the librispeech dataset, but I ended the thread training on a different set of 39 phonemes on the timit dataset.