Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Help with replicating the results for wav2vec-u TIMIT

See original GitHub issue

What is your question?

I am trying to replicate the results for the new wav2vec-u (https://ai.facebook.com/research/publications/unsupervised-speech-recognition) model (currently working on TIMIT). However, it seems that using the default code and scripts gave me something along the lines of 80% UER under the “matched” setting for the 400-utterance core-dev set, before applying self-training.

(Edit 06/01/2021: I changed the ‘mean_pool’ flag for the join segmentor to ‘True’ and the UER improved to 71.66%, but still far away from the reported results)

I have listed my procedures below and some minor modifications to get the code running.

Code

N/A; see below for the modifications.

What have you tried?

Below are my questions and procedures:

For getting TIMIT results, is {train,valid,test}.phn the only set of transcriptions needed? I followed the discussions here (https://github.com/pytorch/fairseq/issues/3425) for data generation, where each line in *.phn matches the order of the corresponding tsv files, and formatted as follows: sil w iy l ay sil b l uw sil ch iy z sil b ah sil t v ih sil t er sil p er f er s sil w ih s sil ch iy s sil
Once I installed faiss, I could run prepare_audio.sh without issues, using the Large (LV-60k) checkpoint. However, it seems that I do not need most of the code in prepare_text.sh. Below are the lines I’ve kept: python $FAIRSEQ_ROOT/fairseq_cli/preprocess.py --dataset-impl mmap --trainpref $target_dir/train.phn --workers 16 --only-source --destdir $target_dir --srcdict $target_dir/dict.phn.txt lmplz -o 4 -S 10% < $target_dir/train.phn --discount_fallback >! $target_dir/lm.phones.filtered.04.arpa build_binary $target_dir/lm.phones.filtered.04.arpa $target_dir/lm.phones.filtered.04.bin lmplz -o 6 -S 10% < $target_dir/train.phn --discount_fallback >! $target_dir/lm.phones.filtered.06.arpa build_binary $target_dir/lm.phones.filtered.06.arpa $target_dir/lm.phones.filtered.06.bin lg=$lg python $FAIRSEQ_ROOT/examples/speech_recognition/kaldi/kaldi_initializer.py fst_dir=$target_dir/fst/phn_to_phn_sil lm_arpa=$target_dir/lm.phones.filtered.06.arpa data_dir=$target_dir "blank_symbol='sil'" I have to add -S 10% as for some reason kenlm threw a malloc OOM error. I also cannot get the line invoking kaldi_initializer.py to run as it threw the following error:

Traceback (most recent call last): File “/nobackup/users/junruin2/fairseq/examples/speech_recognition/kaldi/kaldi_initializer.py”, line 677, in cli_main initalize_kaldi(cfg) File “/nobackup/users/junruin2/fairseq/examples/speech_recognition/kaldi/kaldi_initializer.py”, line 616, in initalize_kaldi cfg.out_labels = cfg.in_labels omegaconf.errors.MissingMandatoryValue: Missing mandatory value: in_labels full_key: in_labels reference_type=Optional[Dict[Union[str, Enum], Any]] object_type=dict

As I understand the kaldi_initializer would not be used for model training, so I moved on-wards.

(Edit 06/03/2021: I have been able to get the kaldi_initializer.py to run by passing in extra arguments for kaldi_root=/path/to/kaldi and in_labels=phn)

I then put the code under data, models, tasks into fairseq’s corresponding directories and changed _init_.py under those directories if necessary. I modified a few things to launch GAN training on top of preprocessed features:

First, some of the class members are not defined correctly in wav2vec_u.py:

self.discriminator` = self.Discriminator(output_size, cfg) self.discriminator = Discriminator(output_size, cfg)

self.generator = self.Generator(d, output_size, cfg, lambda x: self.normalize(x)[0]) self.generator = Generator(d, output_size, cfg)

self.zero_pretrain_updates = 0 self.exponential_code_pen = False self.dynamic_step_thresh = 0

The last three class members are not defined in the code so I had to add them. Not sure if I got those correctly.

I have also modified wav2vec_u.py and unpaired_audio_text.py so that all relevant hardcodes of ‘<SIL>’ are changed to ‘sil’. (I probably should have replaced TIMIT sil with <SIL> beforehand, but either way should work)

(Edit 06/03/2021: I read the code in wav2vec_u.py and it seems that in the function valid_step, silences are removed with the line x = x[x != self.sil_id], but in prepare_text.sh, the phone lm is built with silences. What is the rationale behind it?)

I used the default hyper-parameters provided in config/gan/w2vu.yaml for training the model, but it seems that the script only logged checkpoint_best.pt and checkpoint_last.pt (because no_epoch_checkpoints is set true in the cofig file) based on weighted_lm_ppl, which seems to be the “vocabulary-usage adjusted entropy” mentioned on Page 14 of the paper, except for a vocab_usage_power=2 as hardcoded in unpaired_audio_text.py. I only used checkpoint_best.pt for the later steps, and did not train/validate other model configurations.

I then invoke w2vu_generate.py as follows: python w2vu_generate.py --config-dir config/generate --config-name viterbi \ fairseq.task.data=$FAIRSEQ_ROOT/examples/wav2vec/unsupervised/files/wav2vec_out/precompute_pca512_cls128_mean_pooled/ \ fairseq.common_eval.path=$FAIRSEQ_ROOT/examples/wav2vec/unsupervised/multirun/2021-05-27/04-18-34/0/checkpoint_last.pt \ lm_model=$FAIRSEQ_ROOT/examples/wav2vec/unsupervised/files/wav2vec_out/precompute_pca512_cls128_mean_pooled/lm.phones.filtered.04.bin \ fairseq.dataset.gen_subset=valid results_path=$FAIRSEQ_ROOT/examples/wav2vec/unsupervised/file/wav2vec_transcriptions/ python scripts/wer.py -s $FAIRSEQ_ROOT/examples/wav2vec/unsupervised/files/wav2vec_transcriptions/valid.txt -r $FAIRSEQ_ROOT/examples/wav2vec/unsupervised/files/wav2vec_transcriptions/valid_ref.txt

It seems that the W2lViterbiDecoder as selected by the default config/generate/viterbi.yaml requires an additional argument for criterion. Therefore, I hardcoded it to be ctc: criterion: Optional[str] = field(default="ctc", metadata={"help": "VITERBI criterion?"},)

However, then the wer.py script reports the said 71.66% UER.

Any idea what needs to be changed to get close to the PER reported?

(Edit 06/01/2021)

I have also noticed something I don’t understanding for the logging: first, it says that

[2021-05-31 15:02:13,185][fairseq.data.extracted_features_dataset][INFO] - loaded 3696, skipped 0 samples [2021-05-31 15:02:13,185][fairseq.tasks.unpaired_audio_text][INFO] - split train has unpaired text? True [2021-05-31 15:02:13,228][fairseq.data.data_utils][INFO] - loaded 3,696 examples from: /nobackup/users/junruin2/fairseq/examples/wav2vec/unsupervised/files/wav2vec_out/precompute_pca512_cls128_mean_pooled/train [2021-05-31 15:02:17,351][fairseq.trainer][INFO] - NOTE: your device may support faster training with --fp16 [2021-05-31 15:02:17,611][fairseq.trainer][INFO] - begin training epoch 1 [2021-05-31 15:02:17,612][fairseq_cli.train][INFO] - Start iterating over samples [2021-05-31 15:02:41,611][root][INFO] - Reducer buckets have been rebuilt in this iteration. [2021-05-31 15:02:43,533][fairseq_cli.train][INFO] - begin validation on “valid” subset [2021-05-31 15:02:58,392][valid][INFO] - {“epoch”: 1, “valid_loss”: “0.927”, “valid_ntokens”: “15334”, “valid_nsentences”: “400”, “valid_lm_score_sum”: -31856.94988822937, “valid_num_pred_chars”: 13425.0, “valid_vocab_seen_pct”: “1”, “valid_uer”: 92.72205556280163, “valid_weighted_lm_ppl”: “201.512”, “valid_lm_ppl”: “201.512”, “valid_wps”: “0”, “valid_wpb”: “15334”, “valid_bsz”: “400”, “valid_num_updates”: “6”} [2021-05-31 15:02:58,396][fairseq.checkpoint_utils][INFO] - Preparing to save checkpoint for epoch 1 @ 6 updates [2021-05-31 15:02:58,398][fairseq.trainer][INFO] - Saving checkpoint to ./checkpoint_best.pt [2021-05-31 15:02:58,477][fairseq.trainer][INFO] - Finished saving checkpoint to ./checkpoint_best.pt [2021-05-31 15:02:58,534][fairseq.checkpoint_utils][INFO] - Saved checkpoint ./checkpoint_best.pt (epoch 1 @ 6 updates, score 201.51165633927232) (writing took 0.1390704633668065 seconds) [2021-05-31 15:02:58,534][fairseq_cli.train][INFO] - end of epoch 1 (average epoch stats below) [2021-05-31 15:02:58,541][train][INFO] - {“epoch”: 1, “train_loss”: “41.484”, “train_ntokens”: “616”, “train_nsentences”: “616”, “train_temp”: “2”, “train_code_ppl”: “14.041”, “train_loss_code_pen”: “0.044”, “train_loss_smoothness”: “15.845”, “train_loss_dense_g”: “0.695”, “train_lm_score_sum”: 0.0, “train_num_pred_chars”: 0.0, “train_loss_grad_pen”: “63.134”, “train_loss_dense_d”: “0.691”, “train_loss_token_d”: “0.693”, “train_wps”: “180.5”, “train_ups”: “0.3”, “train_wpb”: “616”, “train_bsz”: “616”, “train_num_updates”: “6”, “train_lr_discriminator”: “0.0005”, “train_lr_generator”: “0.0004”, “train_gnorm”: “34.87”, “train_clip”: “83.3”, “train_train_wall”: “10”, “train_gb_free”: “31.1”, “train_wall”: “45”}

Why does the log tell me that "train_ntokens": "616", "train_nsentences": "616", and that a single epoch finishes in 6 updates, even though TIMIT train set have 3696 examples and that the batch size in the config file has been set to 160?

Although the UER are not great, for the purpose of getting the code to run, I have also tried running train.sh within kaldi_self_train directory. For the w2v features, which set should the script use? Should I use the segment-level, mean pooled features as used by GAN? Because if so, Kaldi would throw the error that utterance has too few frames to align . I could only start Kaldi training with those prepared in precompute_pca512, instead of those in precompute_pca512_cls128_mean_pooled

(Edit 06/03/2021: While I could get Kaldi started with the features in precompute_pca512, the script got stuck at steps/decode.sh --nj 20 --cmd "$decode_cmd" \ $exp_root/mono/graph $data/$valid $exp_root/mono/decode_$valid & within train_subset_lgbeam.sh)

Thanks very much for the help!

What’s your environment?

fairseq Version: master
PyTorch Version: 1.7.1
OS (e.g., Linux): Linux
How you installed fairseq (pip, source): source
Build command you used (if compiling from source): pip install --editable ./
Python version: 3.7.9
CUDA/cuDNN version: 10.2
GPU models and configuration: Tesla V100-SXM2-32GB
Any other relevant information:

Issue Analytics

State:
Created 2 years ago
Reactions:2
Comments:26

Top GitHub Comments

1reaction

JINGZIjingzicommented, Jun 6, 2021

I got it. You can change @register_task(“gan_audio_pretraining_feats”, dataclass=UnpairedAudioTextConfig) in tasks/unpaired_audio_text.py to @register_task(“unpaired_audio_text”, dataclass=UnpairedAudioTextConfig).

0reactions

Ning107commented, Jul 12, 2021

@JeromeNi @JINGZIjingzi @shiva1393 Can you please explain a little bit why should we copy the code files under {fairseq_root}/fairseq/. I don’t understand.

Thanks in advance for your help!