Help with replicating the results for wav2vec-u TIMIT
See original GitHub issueWhat is your question?
I am trying to replicate the results for the new wav2vec-u (https://ai.facebook.com/research/publications/unsupervised-speech-recognition) model (currently working on TIMIT). However, it seems that using the default code and scripts gave me something along the lines of 80% UER under the “matched” setting for the 400-utterance core-dev set, before applying self-training.
(Edit 06/01/2021: I changed the ‘mean_pool’ flag for the join segmentor to ‘True’ and the UER improved to 71.66%, but still far away from the reported results)
I have listed my procedures below and some minor modifications to get the code running.
Code
N/A; see below for the modifications.
What have you tried?
Below are my questions and procedures:
- For getting TIMIT results, is {train,valid,test}.phn the only set of transcriptions needed? I followed the discussions here (https://github.com/pytorch/fairseq/issues/3425) for data generation, where each line in *.phn matches the order of the corresponding tsv files, and formatted as follows:
sil w iy l ay sil b l uw sil ch iy z sil b ah sil t v ih sil t er sil p er f er s sil w ih s sil ch iy s sil
- Once I installed faiss, I could run
prepare_audio.sh
without issues, using theLarge (LV-60k)
checkpoint. However, it seems that I do not need most of the code inprepare_text.sh
. Below are the lines I’ve kept:python $FAIRSEQ_ROOT/fairseq_cli/preprocess.py --dataset-impl mmap --trainpref $target_dir/train.phn --workers 16 --only-source --destdir $target_dir --srcdict $target_dir/dict.phn.txt
lmplz -o 4 -S 10% < $target_dir/train.phn --discount_fallback >! $target_dir/lm.phones.filtered.04.arpa
build_binary $target_dir/lm.phones.filtered.04.arpa $target_dir/lm.phones.filtered.04.bin
lmplz -o 6 -S 10% < $target_dir/train.phn --discount_fallback >! $target_dir/lm.phones.filtered.06.arpa
build_binary $target_dir/lm.phones.filtered.06.arpa $target_dir/lm.phones.filtered.06.bin
lg=$lg python $FAIRSEQ_ROOT/examples/speech_recognition/kaldi/kaldi_initializer.py fst_dir=$target_dir/fst/phn_to_phn_sil lm_arpa=$target_dir/lm.phones.filtered.06.arpa data_dir=$target_dir "blank_symbol='sil'"
I have to add-S 10%
as for some reason kenlm threw a malloc OOM error. I also cannot get the line invokingkaldi_initializer.py
to run as it threw the following error:
Traceback (most recent call last): File “/nobackup/users/junruin2/fairseq/examples/speech_recognition/kaldi/kaldi_initializer.py”, line 677, in cli_main initalize_kaldi(cfg) File “/nobackup/users/junruin2/fairseq/examples/speech_recognition/kaldi/kaldi_initializer.py”, line 616, in initalize_kaldi cfg.out_labels = cfg.in_labels omegaconf.errors.MissingMandatoryValue: Missing mandatory value: in_labels full_key: in_labels reference_type=Optional[Dict[Union[str, Enum], Any]] object_type=dict
As I understand the kaldi_initializer would not be used for model training, so I moved on-wards.
(Edit 06/03/2021: I have been able to get the kaldi_initializer.py to run by passing in extra arguments for kaldi_root=/path/to/kaldi and in_labels=phn)
- I then put the code under
data
,models
,tasks
into fairseq’s corresponding directories and changed_init_.py
under those directories if necessary. I modified a few things to launch GAN training on top of preprocessed features:
First, some of the class members are not defined correctly in wav2vec_u.py
:
self.discriminator` = self.Discriminator(output_size, cfg) self.discriminator = Discriminator(output_size, cfg)
self.generator = self.Generator(d, output_size, cfg, lambda x: self.normalize(x)[0]) self.generator = Generator(d, output_size, cfg)
self.zero_pretrain_updates = 0 self.exponential_code_pen = False self.dynamic_step_thresh = 0
The last three class members are not defined in the code so I had to add them. Not sure if I got those correctly.
I have also modified wav2vec_u.py
and unpaired_audio_text.py
so that all relevant hardcodes of ‘<SIL>’ are changed to ‘sil’. (I probably should have replaced TIMIT sil with <SIL> beforehand, but either way should work)
(Edit 06/03/2021: I read the code in wav2vec_u.py
and it seems that in the function valid_step
, silences are removed with the line x = x[x != self.sil_id]
, but in prepare_text.sh
, the phone lm is built with silences. What is the rationale behind it?)
I used the default hyper-parameters provided in config/gan/w2vu.yaml
for training the model, but it seems that the script only logged checkpoint_best.pt
and checkpoint_last.pt
(because no_epoch_checkpoints is set true in the cofig file) based on weighted_lm_ppl
, which seems to be the “vocabulary-usage adjusted entropy” mentioned on Page 14 of the paper, except for a vocab_usage_power=2
as hardcoded in unpaired_audio_text.py
. I only used checkpoint_best.pt
for the later steps, and did not train/validate other model configurations.
- I then invoke
w2vu_generate.py
as follows:python w2vu_generate.py --config-dir config/generate --config-name viterbi \ fairseq.task.data=$FAIRSEQ_ROOT/examples/wav2vec/unsupervised/files/wav2vec_out/precompute_pca512_cls128_mean_pooled/ \ fairseq.common_eval.path=$FAIRSEQ_ROOT/examples/wav2vec/unsupervised/multirun/2021-05-27/04-18-34/0/checkpoint_last.pt \ lm_model=$FAIRSEQ_ROOT/examples/wav2vec/unsupervised/files/wav2vec_out/precompute_pca512_cls128_mean_pooled/lm.phones.filtered.04.bin \ fairseq.dataset.gen_subset=valid results_path=$FAIRSEQ_ROOT/examples/wav2vec/unsupervised/file/wav2vec_transcriptions/
python scripts/wer.py -s $FAIRSEQ_ROOT/examples/wav2vec/unsupervised/files/wav2vec_transcriptions/valid.txt -r $FAIRSEQ_ROOT/examples/wav2vec/unsupervised/files/wav2vec_transcriptions/valid_ref.txt
It seems that the W2lViterbiDecoder
as selected by the default config/generate/viterbi.yaml
requires an additional argument for criterion
. Therefore, I hardcoded it to be ctc:
criterion: Optional[str] = field(default="ctc", metadata={"help": "VITERBI criterion?"},)
However, then the wer.py
script reports the said 71.66% UER.
Any idea what needs to be changed to get close to the PER reported?
(Edit 06/01/2021)
- I have also noticed something I don’t understanding for the logging: first, it says that
[2021-05-31 15:02:13,185][fairseq.data.extracted_features_dataset][INFO] - loaded 3696, skipped 0 samples [2021-05-31 15:02:13,185][fairseq.tasks.unpaired_audio_text][INFO] - split train has unpaired text? True [2021-05-31 15:02:13,228][fairseq.data.data_utils][INFO] - loaded 3,696 examples from: /nobackup/users/junruin2/fairseq/examples/wav2vec/unsupervised/files/wav2vec_out/precompute_pca512_cls128_mean_pooled/train [2021-05-31 15:02:17,351][fairseq.trainer][INFO] - NOTE: your device may support faster training with --fp16 [2021-05-31 15:02:17,611][fairseq.trainer][INFO] - begin training epoch 1 [2021-05-31 15:02:17,612][fairseq_cli.train][INFO] - Start iterating over samples [2021-05-31 15:02:41,611][root][INFO] - Reducer buckets have been rebuilt in this iteration. [2021-05-31 15:02:43,533][fairseq_cli.train][INFO] - begin validation on “valid” subset [2021-05-31 15:02:58,392][valid][INFO] - {“epoch”: 1, “valid_loss”: “0.927”, “valid_ntokens”: “15334”, “valid_nsentences”: “400”, “valid_lm_score_sum”: -31856.94988822937, “valid_num_pred_chars”: 13425.0, “valid_vocab_seen_pct”: “1”, “valid_uer”: 92.72205556280163, “valid_weighted_lm_ppl”: “201.512”, “valid_lm_ppl”: “201.512”, “valid_wps”: “0”, “valid_wpb”: “15334”, “valid_bsz”: “400”, “valid_num_updates”: “6”} [2021-05-31 15:02:58,396][fairseq.checkpoint_utils][INFO] - Preparing to save checkpoint for epoch 1 @ 6 updates [2021-05-31 15:02:58,398][fairseq.trainer][INFO] - Saving checkpoint to ./checkpoint_best.pt [2021-05-31 15:02:58,477][fairseq.trainer][INFO] - Finished saving checkpoint to ./checkpoint_best.pt [2021-05-31 15:02:58,534][fairseq.checkpoint_utils][INFO] - Saved checkpoint ./checkpoint_best.pt (epoch 1 @ 6 updates, score 201.51165633927232) (writing took 0.1390704633668065 seconds) [2021-05-31 15:02:58,534][fairseq_cli.train][INFO] - end of epoch 1 (average epoch stats below) [2021-05-31 15:02:58,541][train][INFO] - {“epoch”: 1, “train_loss”: “41.484”, “train_ntokens”: “616”, “train_nsentences”: “616”, “train_temp”: “2”, “train_code_ppl”: “14.041”, “train_loss_code_pen”: “0.044”, “train_loss_smoothness”: “15.845”, “train_loss_dense_g”: “0.695”, “train_lm_score_sum”: 0.0, “train_num_pred_chars”: 0.0, “train_loss_grad_pen”: “63.134”, “train_loss_dense_d”: “0.691”, “train_loss_token_d”: “0.693”, “train_wps”: “180.5”, “train_ups”: “0.3”, “train_wpb”: “616”, “train_bsz”: “616”, “train_num_updates”: “6”, “train_lr_discriminator”: “0.0005”, “train_lr_generator”: “0.0004”, “train_gnorm”: “34.87”, “train_clip”: “83.3”, “train_train_wall”: “10”, “train_gb_free”: “31.1”, “train_wall”: “45”}
Why does the log tell me that "train_ntokens": "616", "train_nsentences": "616"
, and that a single epoch finishes in 6 updates, even though TIMIT train set have 3696 examples and that the batch size in the config file has been set to 160?
- Although the UER are not great, for the purpose of getting the code to run, I have also tried running
train.sh
withinkaldi_self_train
directory. For the w2v features, which set should the script use? Should I use the segment-level, mean pooled features as used by GAN? Because if so, Kaldi would throw the error thatutterance has too few frames to align
. I could only start Kaldi training with those prepared inprecompute_pca512
, instead of those inprecompute_pca512_cls128_mean_pooled
(Edit 06/03/2021: While I could get Kaldi started with the features in precompute_pca512
, the script got stuck at steps/decode.sh --nj 20 --cmd "$decode_cmd" \ $exp_root/mono/graph $data/$valid $exp_root/mono/decode_$valid &
within train_subset_lgbeam.sh
)
Thanks very much for the help!
What’s your environment?
- fairseq Version: master
- PyTorch Version: 1.7.1
- OS (e.g., Linux): Linux
- How you installed fairseq (
pip
, source): source - Build command you used (if compiling from source): pip install --editable ./
- Python version: 3.7.9
- CUDA/cuDNN version: 10.2
- GPU models and configuration: Tesla V100-SXM2-32GB
- Any other relevant information:
Issue Analytics
- State:
- Created 2 years ago
- Reactions:2
- Comments:26
Top GitHub Comments
I got it. You can change @register_task(“gan_audio_pretraining_feats”, dataclass=UnpairedAudioTextConfig) in tasks/unpaired_audio_text.py to @register_task(“unpaired_audio_text”, dataclass=UnpairedAudioTextConfig).
@JeromeNi @JINGZIjingzi @shiva1393 Can you please explain a little bit why should we copy the code files under {fairseq_root}/fairseq/. I don’t understand.
Thanks in advance for your help!