Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Errors running prepare_text.sh (and other preprocessing) from wav2vec-u in fresh environment

See original GitHub issue

My Question:

How can I get prepare_text.sh running correctly in a fresh Ubuntu Jupyterlab environment? What needs to be installed, what variables set, etc.?

I’ve run into various issues attempting to run the script prepare_text.sh, from https://github.com/pytorch/fairseq/blob/master/examples/wav2vec/unsupervised/scripts/prepare_text.sh.

Right now, I’m stuck on preprocess.py: error: unrecognized arguments: --dict-only, but I’ve run into some other errors that I’ve had to workaround, detailed below.

Full current output:

After getting through all the other issues I detail below, currently this is what I see when I attempt to run the script.

I cloned the https://github.com/pytorch/fairseq.git repo, and navigated to the scripts folder: https://github.com/pytorch/fairseq/tree/master/examples/wav2vec/unsupervised/scripts before running this.

(wav2vecu_pre) jovyan@user-ofmghcmafhv-jtfbeefyexclusive-0:~/work/fairseq/examples/wav2vec/unsupervised/scripts$ zsh prepare_text.sh sw /home/jovyan/work/WikiDumps/wiki_sw_head.txt /home/jovyan/work/WikiDumps/wiki_sw_head_wav2vecu_prepared.out
sw
sw
/home/jovyan/work/WikiDumps/wiki_sw_head.txt
/home/jovyan/work/WikiDumps/wiki_sw_head_wav2vecu_prepared.out
Warning : `load_model` does not return WordVectorModel or SupervisedModel any more, but a `FastText` object which is very similar.
usage: preprocess.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL] [--log-format LOG_FORMAT] [--tensorboard-logdir TENSORBOARD_LOGDIR] [--seed SEED] [--cpu]
                     [--tpu] [--bf16] [--memory-efficient-bf16] [--fp16] [--memory-efficient-fp16] [--fp16-no-flatten-grads] [--fp16-init-scale FP16_INIT_SCALE]
                     [--fp16-scale-window FP16_SCALE_WINDOW] [--fp16-scale-tolerance FP16_SCALE_TOLERANCE] [--min-loss-scale MIN_LOSS_SCALE]
                     [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--user-dir USER_DIR] [--empty-cache-freq EMPTY_CACHE_FREQ] [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                     [--model-parallel-size MODEL_PARALLEL_SIZE] [--checkpoint-suffix CHECKPOINT_SUFFIX] [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
                     [--quantization-config-path QUANTIZATION_CONFIG_PATH] [--profile]
                     [--criterion {masked_lm,nat_loss,sentence_ranking,ctc,composite_loss,cross_entropy,legacy_masked_lm_loss,sentence_prediction,adaptive_loss,label_smoothed_cross_entropy,wav2vec,label_smoothed_cross_entropy_with_alignment,vocab_parallel_cross_entropy}]
                     [--tokenizer {moses,nltk,space}] [--bpe {sentencepiece,bytes,characters,byte_bpe,gpt2,hf_byte_bpe,fastbpe,subword_nmt,bert}]
                     [--optimizer {adam,adamax,adagrad,adafactor,adadelta,lamb,sgd,nag}]
                     [--lr-scheduler {triangular,fixed,reduce_lr_on_plateau,cosine,polynomial_decay,tri_stage,inverse_sqrt}] [--scoring {sacrebleu,bleu,wer,chrf}]
                     [--task TASK] [-s SRC] [-t TARGET] [--trainpref FP] [--validpref FP] [--testpref FP] [--align-suffix FP] [--destdir DIR] [--thresholdtgt N]
                     [--thresholdsrc N] [--tgtdict FP] [--srcdict FP] [--nwordstgt N] [--nwordssrc N] [--alignfile ALIGN] [--dataset-impl FORMAT] [--joined-dictionary]
                     [--only-source] [--padding-factor N] [--workers N]
preprocess.py: error: unrecognized arguments: --dict-only
cut: /home/jovyan/work/WikiDumps/wiki_sw_head_wav2vecu_prepared.out/dict.txt: No such file or directory
fatal error: PHONEMIZER_ESPEAK_PATH=espeak not found is not an executable file
fatal error: PHONEMIZER_ESPEAK_PATH=espeak not found is not an executable file
one is 
sed: can't read /home/jovyan/work/WikiDumps/wiki_sw_head_wav2vecu_prepared.out/phones.txt: No such file or directory
paste: /home/jovyan/work/WikiDumps/wiki_sw_head_wav2vecu_prepared.out/phones.txt: No such file or directory
usage: preprocess.py [-h] [--no-progress-bar] [--log-interval LOG_INTERVAL] [--log-format LOG_FORMAT] [--tensorboard-logdir TENSORBOARD_LOGDIR] [--seed SEED] [--cpu]
                     [--tpu] [--bf16] [--memory-efficient-bf16] [--fp16] [--memory-efficient-fp16] [--fp16-no-flatten-grads] [--fp16-init-scale FP16_INIT_SCALE]
                     [--fp16-scale-window FP16_SCALE_WINDOW] [--fp16-scale-tolerance FP16_SCALE_TOLERANCE] [--min-loss-scale MIN_LOSS_SCALE]
                     [--threshold-loss-scale THRESHOLD_LOSS_SCALE] [--user-dir USER_DIR] [--empty-cache-freq EMPTY_CACHE_FREQ] [--all-gather-list-size ALL_GATHER_LIST_SIZE]
                     [--model-parallel-size MODEL_PARALLEL_SIZE] [--checkpoint-suffix CHECKPOINT_SUFFIX] [--checkpoint-shard-count CHECKPOINT_SHARD_COUNT]
                     [--quantization-config-path QUANTIZATION_CONFIG_PATH] [--profile]
                     [--criterion {masked_lm,nat_loss,sentence_ranking,ctc,composite_loss,cross_entropy,legacy_masked_lm_loss,sentence_prediction,adaptive_loss,label_smoothed_cross_entropy,wav2vec,label_smoothed_cross_entropy_with_alignment,vocab_parallel_cross_entropy}]
                     [--tokenizer {moses,nltk,space}] [--bpe {sentencepiece,bytes,characters,byte_bpe,gpt2,hf_byte_bpe,fastbpe,subword_nmt,bert}]
                     [--optimizer {adam,adamax,adagrad,adafactor,adadelta,lamb,sgd,nag}]
                     [--lr-scheduler {triangular,fixed,reduce_lr_on_plateau,cosine,polynomial_decay,tri_stage,inverse_sqrt}] [--scoring {sacrebleu,bleu,wer,chrf}]
                     [--task TASK] [-s SRC] [-t TARGET] [--trainpref FP] [--validpref FP] [--testpref FP] [--align-suffix FP] [--destdir DIR] [--thresholdtgt N]
                     [--thresholdsrc N] [--tgtdict FP] [--srcdict FP] [--nwordstgt N] [--nwordssrc N] [--alignfile ALIGN] [--dataset-impl FORMAT] [--joined-dictionary]
                     [--only-source] [--padding-factor N] [--workers N]
preprocess.py: error: unrecognized arguments: --dict-only
2021-06-03 16:39:42 | INFO | fairseq_cli.preprocess | Namespace(no_progress_bar=False, log_interval=100, log_format=None, tensorboard_logdir=None, seed=1, cpu=False, tpu=False, bf16=False, memory_efficient_bf16=False, fp16=False, memory_efficient_fp16=False, fp16_no_flatten_grads=False, fp16_init_scale=128, fp16_scale_window=None, fp16_scale_tolerance=0.0, min_loss_scale=0.0001, threshold_loss_scale=None, user_dir=None, empty_cache_freq=0, all_gather_list_size=16384, model_parallel_size=1, checkpoint_suffix='', checkpoint_shard_count=1, quantization_config_path=None, profile=False, criterion='cross_entropy', tokenizer=None, bpe=None, optimizer=None, lr_scheduler='fixed', scoring='bleu', task='translation', source_lang=None, target_lang=None, trainpref='/home/jovyan/work/WikiDumps/wiki_sw_head_wav2vecu_prepared.out/phones/lm.phones.filtered.txt', validpref=None, testpref=None, align_suffix=None, destdir='/home/jovyan/work/WikiDumps/wiki_sw_head_wav2vecu_prepared.out/phones', thresholdtgt=0, thresholdsrc=0, tgtdict=None, srcdict='/home/jovyan/work/WikiDumps/wiki_sw_head_wav2vecu_prepared.out/phones/dict.phn.txt', nwordstgt=-1, nwordssrc=-1, alignfile=None, dataset_impl='mmap', joined_dictionary=False, only_source=True, padding_factor=8, workers=70)
Traceback (most recent call last):
  File "/home/jovyan/work/fairseq//fairseq_cli/preprocess.py", line 401, in <module>
    cli_main()
  File "/home/jovyan/work/fairseq//fairseq_cli/preprocess.py", line 397, in cli_main
    main(args)
  File "/home/jovyan/work/fairseq//fairseq_cli/preprocess.py", line 98, in main
    src_dict = task.load_dictionary(args.srcdict)
  File "/opt/conda/envs/wav2vecu_pre/lib/python3.9/site-packages/fairseq/tasks/fairseq_task.py", line 54, in load_dictionary
    return Dictionary.load(filename)
  File "/opt/conda/envs/wav2vecu_pre/lib/python3.9/site-packages/fairseq/data/dictionary.py", line 214, in load
    d.add_from_file(f)
  File "/opt/conda/envs/wav2vecu_pre/lib/python3.9/site-packages/fairseq/data/dictionary.py", line 225, in add_from_file
    self.add_from_file(fd)
  File "/opt/conda/envs/wav2vecu_pre/lib/python3.9/site-packages/fairseq/data/dictionary.py", line 249, in add_from_file
    raise RuntimeError(
RuntimeError: Duplicate word found when loading Dictionary: '<SIL>'. Duplicate words can overwrite earlier ones by adding the #fairseq:overwrite flag at the end of the corresponding row in the dictionary file. If using the Camembert model, please download an updated copy of the model file.
prepare_text.sh:49: command not found: lmplz
prepare_text.sh:50: command not found: build_binary
python: can't open file '/home/jovyan/work/fairseq/examples/wav2vec/unsupervised/scripts/examples/speech_recognition/kaldi/kaldi_initializer.py': [Errno 2] No such file or directory
python: can't open file '/home/jovyan/work/fairseq/examples/wav2vec/unsupervised/scripts/examples/speech_recognition/kaldi/kaldi_initializer.py': [Errno 2] No such file or directory
prepare_text.sh:54: command not found: lmplz
prepare_text.sh:55: command not found: build_binary
prepare_text.sh:56: command not found: lmplz
prepare_text.sh:57: command not found: build_binary
Primary config directory not found.
Check that the config directory '/home/jovyan/work/fairseq/examples/speech_recognition/kaldi/config' exists and readable

Fixed (?) Problem: Can’t seem to run it from the same folder as the README (workaround: run from scripts folder)

First, I can’t run it from the same folder as the README at https://github.com/pytorch/fairseq/tree/master/examples/wav2vec/unsupervised#preparation-of-speech-and-text-data says to. If you try doing so, you get errors with, e.g. path not found to other scripts.

zsh scripts/prepare_text.sh sw /home/jovyan/work/WikiDumps/wiki_sw_head.txt /home/jovyan/work/WikiDumps/wiki_sw_head_wav2vecu_prepared.out
sw
sw
/home/jovyan/work/WikiDumps/wiki_sw_head.txt
/home/jovyan/work/WikiDumps/wiki_sw_head_wav2vecu_prepared.out
python: can't open file '/home/jovyan/work/fairseq/examples/wav2vec/unsupervised/normalize_and_filter_text.py': [Errno 2] No such file or directory

Fixed (?) Problem: “ValueError: lid.187.bin cannot be opened for loading!” (workaround: use lid.176.bin instead)

Solution: download a different language ID model, and edit the code to use it.

https://fasttext.cc/docs/en/language-identification.html has a different model, lid.176.bin

wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin

and edit this portion of normalize_and_filter_text.py:

    parser.add_argument(
        "--fasttext-model",
        help="path to fasttext model",
        default="lid.176.bin",
    )

Fixed (?) Problem: dependencies needed (phonemizer, fasttext, fairseq)

The script does not list which dependencies are needed. So far I’ve determined that phonemizer, fasttext are needed, and I think fairseq too. Any more I’m missing?

Fixed (?) Problem: can’t find files in fairseq_cli: (solution: iYou need to set an environment variable, FAIRSEQ_ROOT).

I set this to point to the top level of the cloned repo. not sure if that’s right.

(I cloned the repo to ~/work/fairseq/)

export FAIRSEQ_ROOT=~/work/fairseq/

Fixed (?) Problem: Not sure what language code to use. (guessed `sw`)

I’ve got Swahili data. Not sure whether to use sw, or swahili or what, I assume I should pick from https://github.com/espeak-ng/espeak-ng/blob/master/docs/languages.md

Code

Here’s the command I use to invoke the script. Other than editing the default langid model, I haven’t edited anything else in the repo, should be the same as https://github.com/pytorch/fairseq/tree/master/examples/wav2vec/unsupervised/scripts. git log shows c47a9b2eef0f41b0564c8daf52cb82ea97fc6548 as the commit.

zsh prepare_text.sh language /home/jovyan/work/WikiDumps/wiki_sw_head.txt /home/jovyan/work/WikiDumps/wiki_sw_head_wav2vecu_prepared.out

What have you tried?

Tried reading https://github.com/pytorch/fairseq/tree/master/examples/wav2vec/unsupervised#preparation-of-speech-and-text-data
Tried reading https://github.com/pytorch/fairseq/issues/3581 and https://github.com/pytorch/fairseq/issues/3586
Googling for various keywords such as “fairseq preprocess dict-only”

What’s your environment?

I’m in a Jupyterlab in a Docker container, running Ubuntu.

OS is Ubuntu 20.04.2:

cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.2 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.2 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

pip list:

pip listPackage                Version
---------------------- -------------------
antlr4-python3-runtime 4.8
attrs                  21.2.0
certifi                2021.5.30
cffi                   1.14.5
clldutils              3.9.0
colorlog               5.0.1
csvw                   1.11.0
Cython                 0.29.23
dataclasses            0.6
editdistance           0.5.3
fairseq                0.10.0
fasttext               0.9.2
hydra-core             1.0.6
isodate                0.6.0
joblib                 1.0.1
numpy                  1.20.3
omegaconf              2.0.6
phonemizer             2.2.2
pip                    21.1.2
portalocker            2.0.0
pybind11               2.6.2
pycparser              2.20
python-dateutil        2.8.1
PyYAML                 5.4.1
regex                  2021.4.4
rfc3986                1.5.0
sacrebleu              1.5.1
segments               2.2.0
setuptools             49.6.0.post20210108
six                    1.16.0
tabulate               0.8.9
torch                  1.8.1
tqdm                   4.61.0
typing-extensions      3.10.0.0
uritemplate            3.0.1
wheel                  0.36.2

conda list:

conda list
# packages in environment at /opt/conda/envs/wav2vecu_pre:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       1_gnu    conda-forge
antlr4-python3-runtime    4.8                      pypi_0    pypi
attrs                     21.2.0                   pypi_0    pypi
ca-certificates           2021.5.30            ha878542_0    conda-forge
certifi                   2021.5.30        py39hf3d152e_0    conda-forge
cffi                      1.14.5                   pypi_0    pypi
clldutils                 3.9.0                    pypi_0    pypi
colorlog                  5.0.1                    pypi_0    pypi
csvw                      1.11.0                   pypi_0    pypi
cython                    0.29.23                  pypi_0    pypi
dataclasses               0.6                      pypi_0    pypi
editdistance              0.5.3                    pypi_0    pypi
fairseq                   0.10.0                   pypi_0    pypi
fasttext                  0.9.2                    pypi_0    pypi
hydra-core                1.0.6                    pypi_0    pypi
isodate                   0.6.0                    pypi_0    pypi
joblib                    1.0.1                    pypi_0    pypi
ld_impl_linux-64          2.35.1               hea4e1c9_2    conda-forge
libffi                    3.3                  h58526e2_2    conda-forge
libgcc-ng                 9.3.0               h2828fa1_19    conda-forge
libgomp                   9.3.0               h2828fa1_19    conda-forge
libstdcxx-ng              9.3.0               h6de172a_19    conda-forge
ncurses                   6.2                  h58526e2_4    conda-forge
numpy                     1.20.3                   pypi_0    pypi
omegaconf                 2.0.6                    pypi_0    pypi
openssl                   1.1.1k               h7f98852_0    conda-forge
phonemizer                2.2.2                    pypi_0    pypi
pip                       21.1.2             pyhd8ed1ab_0    conda-forge
portalocker               2.0.0                    pypi_0    pypi
pybind11                  2.6.2                    pypi_0    pypi
pycparser                 2.20                     pypi_0    pypi
python                    3.9.4           hffdb5ce_0_cpython    conda-forge
python-dateutil           2.8.1                    pypi_0    pypi
python_abi                3.9                      1_cp39    conda-forge
pyyaml                    5.4.1                    pypi_0    pypi
readline                  8.1                  h46c0cb4_0    conda-forge
regex                     2021.4.4                 pypi_0    pypi
rfc3986                   1.5.0                    pypi_0    pypi
sacrebleu                 1.5.1                    pypi_0    pypi
segments                  2.2.0                    pypi_0    pypi
setuptools                49.6.0           py39hf3d152e_3    conda-forge
six                       1.16.0                   pypi_0    pypi
sqlite                    3.35.5               h74cdb3f_0    conda-forge
tabulate                  0.8.9                    pypi_0    pypi
tk                        8.6.10               h21135ba_1    conda-forge
torch                     1.8.1                    pypi_0    pypi
tqdm                      4.61.0                   pypi_0    pypi
typing-extensions         3.10.0.0                 pypi_0    pypi
tzdata                    2021a                he74cb21_0    conda-forge
uritemplate               3.0.1                    pypi_0    pypi
wheel                     0.36.2             pyhd3deb0d_0    conda-forge
xz                        5.2.5                h516909a_1    conda-forge
zlib                      1.2.11            h516909a_1010    conda-forge

I also apt-installed phonemizer dependencies:

sudo apt-get install festival espeak-ng mbrola

And finally, here’s what I get from apt list|grep installed apt-list.txt

Issue Analytics

State:
Created 2 years ago
Comments:59 (8 by maintainers)

Top GitHub Comments

2reactions

cdleongcommented, Jun 3, 2021

Followed instructions at https://github.com/kpu/kenlm/blob/master/BUILDING to install dependencies for kenlm. What they don’t mention is that you need to take the resulting binaries from kenlm/build/bin/ and copy them to /usr/bin

0reactions

stale[bot]commented, May 1, 2022

Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!

Top Results From Across the Web

Errors running prepare_text.sh (and other preprocessing) ...

My Question: How can I get prepare_text.sh running correctly in a fresh Ubuntu Jupyterlab environment? What needs to be installed, ...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Errors running prepare_text.sh (and other preprocessing) from wav2vec-u in fresh environment

My Question:

Full current output:

Fixed (?) Problem: Can’t seem to run it from the same folder as the README (workaround: run from scripts folder)

Fixed (?) Problem: “ValueError: lid.187.bin cannot be opened for loading!” (workaround: use lid.176.bin instead)

Fixed (?) Problem: dependencies needed (phonemizer, fasttext, fairseq)

Fixed (?) Problem: can’t find files in fairseq_cli: (solution: iYou need to set an environment variable, FAIRSEQ_ROOT).

Fixed (?) Problem: Not sure what language code to use. (guessed `sw`)

Code

What have you tried?

What’s your environment?

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Layer Norm in XLM-R XL and XXL

Help with replicating the results for wav2vec-u TIMIT

Errors running prepare_text.sh (and other preprocessing) from wav2vec-u in fresh environment

My Question:

Full current output:

Fixed (?) Problem: Can’t seem to run it from the same folder as the README (workaround: run from scripts folder)

Fixed (?) Problem: “ValueError: lid.187.bin cannot be opened for loading!” (workaround: use lid.176.bin instead)

Fixed (?) Problem: dependencies needed (phonemizer, fasttext, fairseq)

Fixed (?) Problem: can’t find files in fairseq_cli: (solution: iYou need to set an environment variable, FAIRSEQ_ROOT).

Fixed (?) Problem: Not sure what language code to use. (guessed sw)

Code

What have you tried?

What’s your environment?

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Layer Norm in XLM-R XL and XXL

Help with replicating the results for wav2vec-u TIMIT

Fixed (?) Problem: Not sure what language code to use. (guessed `sw`)