Errors running prepare_text.sh (and other preprocessing) from wav2vec-u in fresh environment
See original GitHub issueMy Question:
How can I get prepare_text.sh running correctly in a fresh Ubuntu Jupyterlab environment? What needs to be installed, what variables set, etc.?
I’ve run into various issues attempting to run the script prepare_text.sh, from https://github.com/pytorch/fairseq/blob/master/examples/wav2vec/unsupervised/scripts/prepare_text.sh.
Right now, I’m stuck on preprocess.py: error: unrecognized arguments: --dict-only
, but I’ve run into some other errors that I’ve had to workaround, detailed below.
Full current output:
After getting through all the other issues I detail below, currently this is what I see when I attempt to run the script.
I cloned the https://github.com/pytorch/fairseq.git repo, and navigated to the scripts folder: https://github.com/pytorch/fairseq/tree/master/examples/wav2vec/unsupervised/scripts before running this.
(wav2vecu_pre) jovyan@user-ofmghcmafhv-jtfbeefyexclusive-0:~/work/fairseq/examples/wav2vec/unsupervised/scripts$ zsh prepare_text.sh sw /home/jovyan/work/WikiDumps/wiki_sw_head.txt /home/jovyan/work/WikiDumps/wiki_sw_head_wav2vecu_prepared.out
sw
sw
/home/jovyan/work/WikiDumps/wiki_sw_head.txt
/home/jovyan/work/WikiDumps/wiki_sw_head_wav2vecu_prepared.out
Warning : `load_model` does not return WordVectorModel or SupervisedModel any more, but a `FastText` object which is very similar.
usage: preprocess.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL] [--log-format LOG_FORMAT] [--tensorboard-logdir TENSORBOARD_LOGDIR] [--seed SEED] [--cpu]
[--tpu] [--bf16] [--memory-efficient-bf16] [--fp16] [--memory-efficient-fp16] [--fp16-no-flatten-grads] [--fp16-init-scale FP16_INIT_SCALE]
[--fp16-scale-window FP16_SCALE_WINDOW] [--fp16-scale-tolerance FP16_SCALE_TOLERANCE] [--min-loss-scale MIN_LOSS_SCALE]
[--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--user-dir USER_DIR] [--empty-cache-freq EMPTY_CACHE_FREQ] [--all-gather-list-size ALL_GATHER_LIST_SIZE]
[--model-parallel-size MODEL_PARALLEL_SIZE] [--checkpoint-suffix CHECKPOINT_SUFFIX] [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
[--quantization-config-path QUANTIZATION_CONFIG_PATH] [--profile]
[--criterion {masked_lm,nat_loss,sentence_ranking,ctc,composite_loss,cross_entropy,legacy_masked_lm_loss,sentence_prediction,adaptive_loss,label_smoothed_cross_entropy,wav2vec,label_smoothed_cross_entropy_with_alignment,vocab_parallel_cross_entropy}]
[--tokenizer {moses,nltk,space}] [--bpe {sentencepiece,bytes,characters,byte_bpe,gpt2,hf_byte_bpe,fastbpe,subword_nmt,bert}]
[--optimizer {adam,adamax,adagrad,adafactor,adadelta,lamb,sgd,nag}]
[--lr-scheduler {triangular,fixed,reduce_lr_on_plateau,cosine,polynomial_decay,tri_stage,inverse_sqrt}] [--scoring {sacrebleu,bleu,wer,chrf}]
[--task TASK] [-s SRC] [-t TARGET] [--trainpref FP] [--validpref FP] [--testpref FP] [--align-suffix FP] [--destdir DIR] [--thresholdtgt N]
[--thresholdsrc N] [--tgtdict FP] [--srcdict FP] [--nwordstgt N] [--nwordssrc N] [--alignfile ALIGN] [--dataset-impl FORMAT] [--joined-dictionary]
[--only-source] [--padding-factor N] [--workers N]
preprocess.py: error: unrecognized arguments: --dict-only
cut: /home/jovyan/work/WikiDumps/wiki_sw_head_wav2vecu_prepared.out/dict.txt: No such file or directory
fatal error: PHONEMIZER_ESPEAK_PATH=espeak not found is not an executable file
fatal error: PHONEMIZER_ESPEAK_PATH=espeak not found is not an executable file
one is
sed: can't read /home/jovyan/work/WikiDumps/wiki_sw_head_wav2vecu_prepared.out/phones.txt: No such file or directory
paste: /home/jovyan/work/WikiDumps/wiki_sw_head_wav2vecu_prepared.out/phones.txt: No such file or directory
usage: preprocess.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL] [--log-format LOG_FORMAT] [--tensorboard-logdir TENSORBOARD_LOGDIR] [--seed SEED] [--cpu]
[--tpu] [--bf16] [--memory-efficient-bf16] [--fp16] [--memory-efficient-fp16] [--fp16-no-flatten-grads] [--fp16-init-scale FP16_INIT_SCALE]
[--fp16-scale-window FP16_SCALE_WINDOW] [--fp16-scale-tolerance FP16_SCALE_TOLERANCE] [--min-loss-scale MIN_LOSS_SCALE]
[--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--user-dir USER_DIR] [--empty-cache-freq EMPTY_CACHE_FREQ] [--all-gather-list-size ALL_GATHER_LIST_SIZE]
[--model-parallel-size MODEL_PARALLEL_SIZE] [--checkpoint-suffix CHECKPOINT_SUFFIX] [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
[--quantization-config-path QUANTIZATION_CONFIG_PATH] [--profile]
[--criterion {masked_lm,nat_loss,sentence_ranking,ctc,composite_loss,cross_entropy,legacy_masked_lm_loss,sentence_prediction,adaptive_loss,label_smoothed_cross_entropy,wav2vec,label_smoothed_cross_entropy_with_alignment,vocab_parallel_cross_entropy}]
[--tokenizer {moses,nltk,space}] [--bpe {sentencepiece,bytes,characters,byte_bpe,gpt2,hf_byte_bpe,fastbpe,subword_nmt,bert}]
[--optimizer {adam,adamax,adagrad,adafactor,adadelta,lamb,sgd,nag}]
[--lr-scheduler {triangular,fixed,reduce_lr_on_plateau,cosine,polynomial_decay,tri_stage,inverse_sqrt}] [--scoring {sacrebleu,bleu,wer,chrf}]
[--task TASK] [-s SRC] [-t TARGET] [--trainpref FP] [--validpref FP] [--testpref FP] [--align-suffix FP] [--destdir DIR] [--thresholdtgt N]
[--thresholdsrc N] [--tgtdict FP] [--srcdict FP] [--nwordstgt N] [--nwordssrc N] [--alignfile ALIGN] [--dataset-impl FORMAT] [--joined-dictionary]
[--only-source] [--padding-factor N] [--workers N]
preprocess.py: error: unrecognized arguments: --dict-only
2021-06-03 16:39:42 | INFO | fairseq_cli.preprocess | Namespace(no_progress_bar=False, log_interval=100, log_format=None, tensorboard_logdir=None, seed=1, cpu=False, tpu=False, bf16=False, memory_efficient_bf16=False, fp16=False, memory_efficient_fp16=False, fp16_no_flatten_grads=False, fp16_init_scale=128, fp16_scale_window=None, fp16_scale_tolerance=0.0, min_loss_scale=0.0001, threshold_loss_scale=None, user_dir=None, empty_cache_freq=0, all_gather_list_size=16384, model_parallel_size=1, checkpoint_suffix='', checkpoint_shard_count=1, quantization_config_path=None, profile=False, criterion='cross_entropy', tokenizer=None, bpe=None, optimizer=None, lr_scheduler='fixed', scoring='bleu', task='translation', source_lang=None, target_lang=None, trainpref='/home/jovyan/work/WikiDumps/wiki_sw_head_wav2vecu_prepared.out/phones/lm.phones.filtered.txt', validpref=None, testpref=None, align_suffix=None, destdir='/home/jovyan/work/WikiDumps/wiki_sw_head_wav2vecu_prepared.out/phones', thresholdtgt=0, thresholdsrc=0, tgtdict=None, srcdict='/home/jovyan/work/WikiDumps/wiki_sw_head_wav2vecu_prepared.out/phones/dict.phn.txt', nwordstgt=-1, nwordssrc=-1, alignfile=None, dataset_impl='mmap', joined_dictionary=False, only_source=True, padding_factor=8, workers=70)
Traceback (most recent call last):
File "/home/jovyan/work/fairseq//fairseq_cli/preprocess.py", line 401, in <module>
cli_main()
File "/home/jovyan/work/fairseq//fairseq_cli/preprocess.py", line 397, in cli_main
main(args)
File "/home/jovyan/work/fairseq//fairseq_cli/preprocess.py", line 98, in main
src_dict = task.load_dictionary(args.srcdict)
File "/opt/conda/envs/wav2vecu_pre/lib/python3.9/site-packages/fairseq/tasks/fairseq_task.py", line 54, in load_dictionary
return Dictionary.load(filename)
File "/opt/conda/envs/wav2vecu_pre/lib/python3.9/site-packages/fairseq/data/dictionary.py", line 214, in load
d.add_from_file(f)
File "/opt/conda/envs/wav2vecu_pre/lib/python3.9/site-packages/fairseq/data/dictionary.py", line 225, in add_from_file
self.add_from_file(fd)
File "/opt/conda/envs/wav2vecu_pre/lib/python3.9/site-packages/fairseq/data/dictionary.py", line 249, in add_from_file
raise RuntimeError(
RuntimeError: Duplicate word found when loading Dictionary: '<SIL>'. Duplicate words can overwrite earlier ones by adding the #fairseq:overwrite flag at the end of the corresponding row in the dictionary file. If using the Camembert model, please download an updated copy of the model file.
prepare_text.sh:49: command not found: lmplz
prepare_text.sh:50: command not found: build_binary
python: can't open file '/home/jovyan/work/fairseq/examples/wav2vec/unsupervised/scripts/examples/speech_recognition/kaldi/kaldi_initializer.py': [Errno 2] No such file or directory
python: can't open file '/home/jovyan/work/fairseq/examples/wav2vec/unsupervised/scripts/examples/speech_recognition/kaldi/kaldi_initializer.py': [Errno 2] No such file or directory
prepare_text.sh:54: command not found: lmplz
prepare_text.sh:55: command not found: build_binary
prepare_text.sh:56: command not found: lmplz
prepare_text.sh:57: command not found: build_binary
Primary config directory not found.
Check that the config directory '/home/jovyan/work/fairseq/examples/speech_recognition/kaldi/config' exists and readable
Fixed (?) Problem: Can’t seem to run it from the same folder as the README (workaround: run from scripts folder)
First, I can’t run it from the same folder as the README at https://github.com/pytorch/fairseq/tree/master/examples/wav2vec/unsupervised#preparation-of-speech-and-text-data says to. If you try doing so, you get errors with, e.g. path not found to other scripts.
zsh scripts/prepare_text.sh sw /home/jovyan/work/WikiDumps/wiki_sw_head.txt /home/jovyan/work/WikiDumps/wiki_sw_head_wav2vecu_prepared.out
sw
sw
/home/jovyan/work/WikiDumps/wiki_sw_head.txt
/home/jovyan/work/WikiDumps/wiki_sw_head_wav2vecu_prepared.out
python: can't open file '/home/jovyan/work/fairseq/examples/wav2vec/unsupervised/normalize_and_filter_text.py': [Errno 2] No such file or directory
Fixed (?) Problem: “ValueError: lid.187.bin cannot be opened for loading!” (workaround: use lid.176.bin instead)
Solution: download a different language ID model, and edit the code to use it.
https://fasttext.cc/docs/en/language-identification.html has a different model, lid.176.bin
wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin
and edit this portion of normalize_and_filter_text.py:
parser.add_argument(
"--fasttext-model",
help="path to fasttext model",
default="lid.176.bin",
)
Fixed (?) Problem: dependencies needed (phonemizer, fasttext, fairseq)
The script does not list which dependencies are needed. So far I’ve determined that phonemizer, fasttext are needed, and I think fairseq too. Any more I’m missing?
Fixed (?) Problem: can’t find files in fairseq_cli: (solution: iYou need to set an environment variable, FAIRSEQ_ROOT).
I set this to point to the top level of the cloned repo. not sure if that’s right.
(I cloned the repo to ~/work/fairseq/
)
export FAIRSEQ_ROOT=~/work/fairseq/
Fixed (?) Problem: Not sure what language code to use. (guessed sw
)
I’ve got Swahili data. Not sure whether to use sw
, or swahili
or what, I assume I should pick from https://github.com/espeak-ng/espeak-ng/blob/master/docs/languages.md
Code
Here’s the command I use to invoke the script. Other than editing the default langid model, I haven’t edited anything else in the repo, should be the same as https://github.com/pytorch/fairseq/tree/master/examples/wav2vec/unsupervised/scripts. git log shows c47a9b2eef0f41b0564c8daf52cb82ea97fc6548 as the commit.
zsh prepare_text.sh language /home/jovyan/work/WikiDumps/wiki_sw_head.txt /home/jovyan/work/WikiDumps/wiki_sw_head_wav2vecu_prepared.out
What have you tried?
- Tried reading https://github.com/pytorch/fairseq/tree/master/examples/wav2vec/unsupervised#preparation-of-speech-and-text-data
- Tried reading https://github.com/pytorch/fairseq/issues/3581 and https://github.com/pytorch/fairseq/issues/3586
- Googling for various keywords such as “fairseq preprocess dict-only”
What’s your environment?
I’m in a Jupyterlab in a Docker container, running Ubuntu.
OS is Ubuntu 20.04.2:
cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.2 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.2 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
pip list:
pip listPackage Version
---------------------- -------------------
antlr4-python3-runtime 4.8
attrs 21.2.0
certifi 2021.5.30
cffi 1.14.5
clldutils 3.9.0
colorlog 5.0.1
csvw 1.11.0
Cython 0.29.23
dataclasses 0.6
editdistance 0.5.3
fairseq 0.10.0
fasttext 0.9.2
hydra-core 1.0.6
isodate 0.6.0
joblib 1.0.1
numpy 1.20.3
omegaconf 2.0.6
phonemizer 2.2.2
pip 21.1.2
portalocker 2.0.0
pybind11 2.6.2
pycparser 2.20
python-dateutil 2.8.1
PyYAML 5.4.1
regex 2021.4.4
rfc3986 1.5.0
sacrebleu 1.5.1
segments 2.2.0
setuptools 49.6.0.post20210108
six 1.16.0
tabulate 0.8.9
torch 1.8.1
tqdm 4.61.0
typing-extensions 3.10.0.0
uritemplate 3.0.1
wheel 0.36.2
conda list:
conda list
# packages in environment at /opt/conda/envs/wav2vecu_pre:
#
# Name Version Build Channel
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 1_gnu conda-forge
antlr4-python3-runtime 4.8 pypi_0 pypi
attrs 21.2.0 pypi_0 pypi
ca-certificates 2021.5.30 ha878542_0 conda-forge
certifi 2021.5.30 py39hf3d152e_0 conda-forge
cffi 1.14.5 pypi_0 pypi
clldutils 3.9.0 pypi_0 pypi
colorlog 5.0.1 pypi_0 pypi
csvw 1.11.0 pypi_0 pypi
cython 0.29.23 pypi_0 pypi
dataclasses 0.6 pypi_0 pypi
editdistance 0.5.3 pypi_0 pypi
fairseq 0.10.0 pypi_0 pypi
fasttext 0.9.2 pypi_0 pypi
hydra-core 1.0.6 pypi_0 pypi
isodate 0.6.0 pypi_0 pypi
joblib 1.0.1 pypi_0 pypi
ld_impl_linux-64 2.35.1 hea4e1c9_2 conda-forge
libffi 3.3 h58526e2_2 conda-forge
libgcc-ng 9.3.0 h2828fa1_19 conda-forge
libgomp 9.3.0 h2828fa1_19 conda-forge
libstdcxx-ng 9.3.0 h6de172a_19 conda-forge
ncurses 6.2 h58526e2_4 conda-forge
numpy 1.20.3 pypi_0 pypi
omegaconf 2.0.6 pypi_0 pypi
openssl 1.1.1k h7f98852_0 conda-forge
phonemizer 2.2.2 pypi_0 pypi
pip 21.1.2 pyhd8ed1ab_0 conda-forge
portalocker 2.0.0 pypi_0 pypi
pybind11 2.6.2 pypi_0 pypi
pycparser 2.20 pypi_0 pypi
python 3.9.4 hffdb5ce_0_cpython conda-forge
python-dateutil 2.8.1 pypi_0 pypi
python_abi 3.9 1_cp39 conda-forge
pyyaml 5.4.1 pypi_0 pypi
readline 8.1 h46c0cb4_0 conda-forge
regex 2021.4.4 pypi_0 pypi
rfc3986 1.5.0 pypi_0 pypi
sacrebleu 1.5.1 pypi_0 pypi
segments 2.2.0 pypi_0 pypi
setuptools 49.6.0 py39hf3d152e_3 conda-forge
six 1.16.0 pypi_0 pypi
sqlite 3.35.5 h74cdb3f_0 conda-forge
tabulate 0.8.9 pypi_0 pypi
tk 8.6.10 h21135ba_1 conda-forge
torch 1.8.1 pypi_0 pypi
tqdm 4.61.0 pypi_0 pypi
typing-extensions 3.10.0.0 pypi_0 pypi
tzdata 2021a he74cb21_0 conda-forge
uritemplate 3.0.1 pypi_0 pypi
wheel 0.36.2 pyhd3deb0d_0 conda-forge
xz 5.2.5 h516909a_1 conda-forge
zlib 1.2.11 h516909a_1010 conda-forge
I also apt-installed phonemizer dependencies:
sudo apt-get install festival espeak-ng mbrola
And finally, here’s what I get from apt list|grep installed
apt-list.txt
Issue Analytics
- State:
- Created 2 years ago
- Comments:59 (8 by maintainers)
Top GitHub Comments
Followed instructions at https://github.com/kpu/kenlm/blob/master/BUILDING to install dependencies for kenlm. What they don’t mention is that you need to take the resulting binaries from kenlm/build/bin/ and copy them to /usr/bin
Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!