question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

utils/validate_data_dir.sh: text contains 1 lines with non-printable characters during processing mustc v1

See original GitHub issue

Describe the bug

Hello, when i use the deafult recipe for mustc v1, a bug raised(at stage 1): utils/validate_data_dir.sh: text contains 1 lines with non-printable characters I think this bug is related to #4126 and #4157, but I don’t know how to avoid it.

Task information:

  • Task: ST
  • Recipe: must_c
  • ESPnet1

To Reproduce

  1. cd egs/must_c/st1`
  2. ./run_basectc.sh --stage 0 --stop_stage 1
#!/usr/bin/env bash
# run_basectc.sh
# Copyright 2019 Kyoto University (Hirofumi Inaguma)
#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)

. ./path.sh || exit 1;
. ./cmd.sh || exit 1;

# general configuration
backend=pytorch
stage=-1        # start from -1 if you need to start from data download
stop_stage=100
ngpu=1          # number of gpus during training ("0" uses cpu, otherwise use gpu)
dec_ngpu=0      # number of gpus during decoding ("0" uses cpu, otherwise use gpu)
nj=8            # number of parallel jobs for decoding
debugmode=1
dumpdir=dump    # directory to dump full features
N=0             # number of minibatches to be used (mainly for debugging). "0" uses all minibatches.
verbose=0       # verbose option
resume=         # Resume the training from snapshot
seed=1          # seed to generate random number
# feature configuration
do_delta=false

preprocess_config=conf/specaug.yaml
train_config=conf/train.yaml
decode_config=conf/decode.yaml

# decoding parameter
trans_model=model.acc.best # set a model to be used for decoding: 'model.acc.best' or 'model.loss.best'

# model average realted (only for transformer)
n_average=10                 # the number of ST models to be averaged
use_valbest_average=true     # if true, the validation `n_average`-best ST models will be averaged.
                             # if false, the last `n_average` ST models will be averaged.
metric=bleu                  # loss/acc/bleu

# pre-training related
asr_model=
mt_model=

# preprocessing related
src_case=lc.rm
tgt_case=tc
# tc: truecase
# lc: lowercase
# lc.rm: lowercase with punctuation removal

# postprocessing related
remove_nonverbal=true  # remove non-verbal labels such as "( Applaus )"
# NOTE: IWSLT community accepts this setting and therefore we use this by default

# Set this to somewhere where you want to put your data, or where
# someone else has already put it.
must_c=mustc_v1.0

# target language related
tgt_lang=de
# you can choose from de, es, fr, it, nl, pt, ro, ru

# bpemode (unigram or bpe)
nbpe=10000
bpemode=unigram

# exp tag
tag="" # tag for managing experiments.

. utils/parse_options.sh || exit 1;

# Set bash to 'debug' mode, it will exit on :
# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',
set -e
set -u
set -o pipefail

train_set=train_sp.en-${tgt_lang}.${tgt_lang}
train_dev=dev.en-${tgt_lang}.${tgt_lang}
trans_set="dev_org.en-${tgt_lang}.${tgt_lang} tst-COMMON.en-${tgt_lang}.${tgt_lang} tst-HE.en-${tgt_lang}.${tgt_lang}"

if [ ${stage} -le -1 ] && [ ${stop_stage} -ge -1 ]; then
    echo "stage -1: Data Download"
    local/download_and_untar.sh ${must_c} ${tgt_lang} "v1"
fi

if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
    ### Task dependent. You have to make data the following preparation part by yourself.
    ### But you can utilize Kaldi recipes in most cases
    echo "stage 0: Data Preparation"
    local/data_prep.sh ${must_c} ${tgt_lang} "v1"
fi

feat_tr_dir=${dumpdir}/${train_set}/delta${do_delta}; mkdir -p ${feat_tr_dir}
feat_dt_dir=${dumpdir}/${train_dev}/delta${do_delta}; mkdir -p ${feat_dt_dir}
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
    ### Task dependent. You have to design training and dev sets by yourself.
    ### But you can utilize Kaldi recipes in most cases
    echo "stage 1: Feature Generation"
    fbankdir=fbank
    # Generate the fbank features; by default 80-dimensional fbanks with pitch on each frame
    for lang in $(echo ${tgt_lang} | tr '_' ' '); do
        for x in dev.en-${tgt_lang} tst-COMMON.en-${tgt_lang} tst-HE.en-${tgt_lang}; do
            steps/make_fbank_pitch.sh --cmd "$train_cmd" --nj 32 --write_utt2num_frames true \
                data/${x} exp/make_fbank/${x} ${fbankdir}
        done
    done

    # speed perturbation
    speed_perturb.sh --cmd "$train_cmd" --speeds "1.0" --cases "lc.rm lc tc" --langs "en ${tgt_lang}" data/train.en-${tgt_lang} data/train_sp.en-${tgt_lang} ${fbankdir}

    # Divide into source and target languages
    for x in train_sp.en-${tgt_lang} dev.en-${tgt_lang} tst-COMMON.en-${tgt_lang} tst-HE.en-${tgt_lang}; do
        divide_lang.sh ${x} "en ${tgt_lang}"
    done
    for lang in ${tgt_lang} en; do
        cp -rf data/dev.en-${tgt_lang}.${lang} data/dev_org.en-${tgt_lang}.${lang}
    done

    # remove long and short utterances
    for x in train_sp.en-${tgt_lang} dev.en-${tgt_lang}; do
        clean_corpus.sh --maxframes 3000 --maxchars 400 --utt_extra_files "text.tc text.lc text.lc.rm" data/${x} "en ${tgt_lang}"
    done

    # compute global CMVN
    compute-cmvn-stats scp:data/${train_set}/feats.scp data/${train_set}/cmvn.ark

    # dump features for training
    dump.sh --cmd "$train_cmd" --nj 80 --do_delta $do_delta \
        data/${train_set}/feats.scp data/${train_set}/cmvn.ark exp/dump_feats/${train_set} ${feat_tr_dir}
    for x in ${train_dev} ${trans_set}; do
        feat_trans_dir=${dumpdir}/${x}/delta${do_delta}; mkdir -p ${feat_trans_dir}
        dump.sh --cmd "$train_cmd" --nj 32 --do_delta $do_delta \
            data/${x}/feats.scp data/${train_set}/cmvn.ark exp/dump_feats/trans/${x} ${feat_trans_dir}
    done
fi

dict=data/lang_1spm/${train_set}_${bpemode}${nbpe}_units_${tgt_case}.txt
nlsyms=data/lang_1spm/${train_set}_non_lang_syms_${tgt_case}.txt
bpemodel=data/lang_1spm/${train_set}_${bpemode}${nbpe}_${tgt_case}
echo "dictionary: ${dict}"
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
    ### Task dependent. You have to check non-linguistic symbols used in the corpus.
    echo "stage 2: Dictionary and Json Data Preparation"
    mkdir -p data/lang_1spm/

    echo "make a non-linguistic symbol list for all languages"
    grep sp1.0 data/train_sp.en-${tgt_lang}.*/text.${tgt_case} | cut -f 2- -d' ' | grep -o -P '&[^;]*;'| sort | uniq > ${nlsyms}
    cat ${nlsyms}

    echo "make a joint source and target dictionary"
    echo "<unk> 1" > ${dict} # <unk> must be 1, 0 will be used for "blank" in CTC
    offset=$(wc -l < ${dict})
    grep sp1.0 data/train_sp.en-${tgt_lang}.*/text.${tgt_case} | cut -f 2- -d' ' | grep -v -e '^\s*$' > data/lang_1spm/input_${tgt_lang}_${src_case}_${tgt_case}.txt
    spm_train --user_defined_symbols="$(tr "\n" "," < ${nlsyms})" --input=data/lang_1spm/input_${tgt_lang}_${src_case}_${tgt_case}.txt \
        --vocab_size=${nbpe} --model_type=${bpemode} --model_prefix=${bpemodel} --input_sentence_size=100000000 --character_coverage=1.0
    spm_encode --model=${bpemodel}.model --output_format=piece < data/lang_1spm/input_${tgt_lang}_${src_case}_${tgt_case}.txt \
        | tr ' ' '\n' | sort | uniq | awk -v offset=${offset} '{print $0 " " NR+offset}' >> ${dict}
    wc -l ${dict}

    echo "make json files"
    data2json.sh --nj 16 --feat ${feat_tr_dir}/feats.scp --text data/${train_set}/text.${tgt_case} --bpecode ${bpemodel}.model --lang "${tgt_lang}" \
        data/${train_set} ${dict} > ${feat_tr_dir}/data_${bpemode}${nbpe}.${src_case}_${tgt_case}.json
    for x in ${train_dev} ${trans_set}; do
        feat_trans_dir=${dumpdir}/${x}/delta${do_delta}
        data2json.sh --feat ${feat_trans_dir}/feats.scp --text data/${x}/text.${tgt_case} --bpecode ${bpemodel}.model --lang "${tgt_lang}" \
            data/${x} ${dict} > ${feat_trans_dir}/data_${bpemode}${nbpe}.${src_case}_${tgt_case}.json
    done

    # update json (add source references)
    for x in ${train_set} ${train_dev} ${trans_set}; do
        feat_dir=${dumpdir}/${x}/delta${do_delta}
        data_dir=data/$(echo ${x} | cut -f 1 -d ".").en-${tgt_lang}.en
        update_json.sh --text ${data_dir}/text.${src_case} --bpecode ${bpemodel}.model \
            ${feat_dir}/data_${bpemode}${nbpe}.${src_case}_${tgt_case}.json ${data_dir} ${dict}
    done
fi

# NOTE: skip stage 3: LM Preparation

if [ -z ${tag} ]; then
    expname=${train_set}_${tgt_case}_${backend}_$(basename ${train_config%.*})_${bpemode}${nbpe}
    if ${do_delta}; then
        expname=${expname}_delta
    fi
    if [ -n "${preprocess_config}" ]; then
        expname=${expname}_$(basename ${preprocess_config%.*})
    fi
    if [ -n "${asr_model}" ]; then
        expname=${expname}_asrtrans
    fi
    if [ -n "${mt_model}" ]; then
        expname=${expname}_mttrans
    fi
else
    expname=${train_set}_${tgt_case}_${backend}_${tag}
fi
expdir=exp/${expname}
mkdir -p ${expdir}

if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
    echo "stage 4: Network Training"

    ${cuda_cmd} --gpu ${ngpu} ${expdir}/train.log \
        st_train.py \
        --config ${train_config} \
        --preprocess-conf ${preprocess_config} \
        --ngpu ${ngpu} \
        --backend ${backend} \
        --outdir ${expdir}/results \
        --tensorboard-dir tensorboard/${expname} \
        --debugmode ${debugmode} \
        --dict ${dict} \
        --debugdir ${expdir} \
        --minibatches ${N} \
        --seed ${seed} \
        --verbose ${verbose} \
        --resume ${resume} \
        --train-json ${feat_tr_dir}/data_${bpemode}${nbpe}.${src_case}_${tgt_case}.json \
        --valid-json ${feat_dt_dir}/data_${bpemode}${nbpe}.${src_case}_${tgt_case}.json \
        --enc-init ${asr_model} \
        --dec-init ${mt_model}
fi

if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
    echo "stage 5: Decoding"
    if [[ $(get_yaml.py ${train_config} model-module) = *transformer* ]] || \
       [[ $(get_yaml.py ${train_config} model-module) = *conformer* ]]; then
        # Average ST models
        if ${use_valbest_average}; then
            trans_model=model.val${n_average}.avg.best
            opt="--log ${expdir}/results/log --metric ${metric}"
        else
            trans_model=model.last${n_average}.avg.best
            opt="--log"
        fi
        average_checkpoints.py \
            ${opt} \
            --backend ${backend} \
            --snapshots ${expdir}/results/snapshot.ep.* \
            --out ${expdir}/results/${trans_model} \
            --num ${n_average}
    fi

    if [ ${dec_ngpu} = 1 ]; then
        nj=1
    fi

    pids=() # initialize pids
    for x in ${trans_set}; do
    (
        decode_dir=decode_${x}_$(basename ${decode_config%.*})
        feat_trans_dir=${dumpdir}/${x}/delta${do_delta}

        # reset log for RTF calculation
        if [ -f ${expdir}/${decode_dir}/log/decode.1.log ]; then
            rm ${expdir}/${decode_dir}/log/decode.*.log
        fi

        # split data
        splitjson.py --parts ${nj} ${feat_trans_dir}/data_${bpemode}${nbpe}.${src_case}_${tgt_case}.json

        ${decode_cmd} JOB=1:${nj} ${expdir}/${decode_dir}/log/decode.JOB.log \
            st_trans.py \
            --config ${decode_config} \
            --ngpu ${dec_ngpu} \
            --backend ${backend} \
            --batchsize 0 \
            --trans-json ${feat_trans_dir}/split${nj}utt/data_${bpemode}${nbpe}.JOB.json \
            --result-label ${expdir}/${decode_dir}/data.JOB.json \
            --model ${expdir}/results/${trans_model}

        score_bleu.sh --case ${tgt_case} --bpemodel ${bpemodel}.model \
            --remove_nonverbal ${remove_nonverbal} \
            ${expdir}/${decode_dir} ${tgt_lang} ${dict}

        calculate_rtf.py --log-dir ${expdir}/${decode_dir}/log
    ) &
    pids+=($!) # store background pids
    done
    i=0; for pid in "${pids[@]}"; do wait ${pid} || ((++i)); done
    [ ${i} -gt 0 ] && echo "$0: ${i} background jobs are failed." && false
    echo "Finished"
fi

Error logs (espnet) [.xxx@localhost st1]$ ./run_basectc.sh --stage 0 --stop_stage 0 stage 0: Data Preparation remove duplicate lines… Reduced #utt from 229699 to 229696 fix_data_dir.sh: kept all 229696 utterances. fix_data_dir.sh: old files are kept in data/train.en-de/.backup local/data_prep.sh: successfully prepared data in data/local/en-de/train remove duplicate lines… Reduced #utt from 1423 to 1423 fix_data_dir.sh: kept all 1423 utterances. fix_data_dir.sh: old files are kept in data/dev.en-de/.backup local/data_prep.sh: successfully prepared data in data/local/en-de/dev remove duplicate lines… Reduced #utt from 2641 to 2641 fix_data_dir.sh: kept all 2641 utterances. fix_data_dir.sh: old files are kept in data/tst-COMMON.en-de/.backup local/data_prep.sh: successfully prepared data in data/local/en-de/tst-COMMON remove duplicate lines… Reduced #utt from 600 to 600 fix_data_dir.sh: kept all 600 utterances. fix_data_dir.sh: old files are kept in data/tst-HE.en-de/.backup local/data_prep.sh: successfully prepared data in data/local/en-de/tst-HE dictionary: data/lang_1spm/train_sp.en-de.de_unigram10000_units_tc.txt (espnet) [xxx@localhost st1]$ ./run_basectc.sh --stage 1 --stop_stage 1 stage 1: Feature Generation steps/make_fbank_pitch.sh --cmd run.pl --nj 32 --write_utt2num_frames true data/dev.en-de exp/make_fbank/dev.en-de fbank utils/validate_data_dir.sh: Successfully validated data-directory data/dev.en-de steps/make_fbank_pitch.sh [info]: segments file exists: using that. steps/make_fbank_pitch.sh: Succeeded creating filterbank and pitch features for dev.en-de steps/make_fbank_pitch.sh --cmd run.pl --nj 32 --write_utt2num_frames true data/tst-COMMON.en-de exp/make_fbank/tst-COMMON.en-de fbank utils/validate_data_dir.sh: Successfully validated data-directory data/tst-COMMON.en-de steps/make_fbank_pitch.sh [info]: segments file exists: using that. steps/make_fbank_pitch.sh: Succeeded creating filterbank and pitch features for tst-COMMON.en-de steps/make_fbank_pitch.sh --cmd run.pl --nj 32 --write_utt2num_frames true data/tst-HE.en-de exp/make_fbank/tst-HE.en-de fbank utils/validate_data_dir.sh: Successfully validated data-directory data/tst-HE.en-de steps/make_fbank_pitch.sh [info]: segments file exists: using that. steps/make_fbank_pitch.sh: Succeeded creating filterbank and pitch features for tst-HE.en-de /mnt/hanyuchen/espnet/espnet/egs/must_c/st1/…/…/…/utils/speed_perturb.sh --cmd run.pl --speeds 1.0 --cases lc.rm lc tc --langs en de data/train.en-de data/train_sp.en-de fbank utils/data/get_utt2dur.sh: working out data/train.en-de/utt2dur from data/train.en-de/segments utils/data/get_utt2dur.sh: computed data/train.en-de/utt2dur utils/data/get_reco2dur.sh: obtaining durations from recordings utils/data/get_reco2dur.sh: could not get recording lengths from sphere-file headers, using wav-to-duration utils/data/get_reco2dur.sh: computed data/train.en-de/reco2dur utils/perturb_data_dir_speed.sh: generated speed-perturbed version of data in data/train.en-de, in data/train.en-de/tmp-hTyC9/temp.1.0 fix_data_dir.sh: kept all 229696 utterances. fix_data_dir.sh: old files are kept in data/train.en-de/tmp-hTyC9/temp.1.0/.backup utils/validate_data_dir.sh: Successfully validated data-directory data/train.en-de/tmp-hTyC9/temp.1.0 utils/combine_data.sh --extra-files utt2uniq data/train_sp.en-de data/train.en-de/tmp-hTyC9/temp.1.0 utils/combine_data.sh: combined utt2uniq utils/combine_data.sh: combined segments utils/combine_data.sh: combined utt2spk utils/combine_data.sh [info]: not combining utt2lang as it does not exist utils/combine_data.sh: combined utt2dur utils/combine_data.sh [info]: not combining utt2num_frames as it does not exist utils/combine_data.sh: combined reco2dur utils/combine_data.sh [info]: not combining feats.scp as it does not exist utils/combine_data.sh [info]: not combining text as it does not exist utils/combine_data.sh [info]: not combining cmvn.scp as it does not exist utils/combine_data.sh [info]: not combining vad.scp as it does not exist utils/combine_data.sh [info]: not combining reco2file_and_channel as it does not exist utils/combine_data.sh: combined wav.scp utils/combine_data.sh [info]: not combining spk2gender as it does not exist fix_data_dir.sh: kept all 229696 utterances. fix_data_dir.sh: old files are kept in data/train_sp.en-de/.backup steps/make_fbank_pitch.sh --cmd run.pl --nj 32 --write_utt2num_frames true data/train_sp.en-de exp/make_fbank/train_sp.en-de fbank utils/validate_data_dir.sh: Successfully validated data-directory data/train_sp.en-de steps/make_fbank_pitch.sh [info]: segments file exists: using that. steps/make_fbank_pitch.sh: Succeeded creating filterbank and pitch features for train_sp.en-de fix_data_dir.sh: kept all 229696 utterances. fix_data_dir.sh: old files are kept in data/train_sp.en-de/.backup utils/validate_data_dir.sh: Successfully validated data-directory data/train_sp.en-de fix_data_dir.sh: kept all 229696 utterances. fix_data_dir.sh: old files are kept in data/train_sp.en-de.en/.backup utils/validate_data_dir.sh: text contains 1 lines with non-printable characters

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
kan-bayashicommented, Feb 10, 2022

–non_print option is introduced in kaldi recently. You can utils/validate_data_dir.sh to set non_print=true. https://github.com/kaldi-asr/kaldi/blob/3ec108da76e3d9dba901fb69f046d0e46170b8e7/egs/wsj/s5/utils/validate_data_dir.sh#L9

0reactions
hannlpcommented, Feb 21, 2022

–non_print option is introduced in kaldi recently. You can utils/validate_data_dir.sh to set non_print=true. https://github.com/kaldi-asr/kaldi/blob/3ec108da76e3d9dba901fb69f046d0e46170b8e7/egs/wsj/s5/utils/validate_data_dir.sh#L9

It works for me, thank you!

Read more comments on GitHub >

github_iconTop Results From Across the Web

utils/validate_data_dir.sh: text contains 990495 lines with non ...
I have cross checked since your last message. There is no script change. When loading prepared text, it also says data utf-8 encoded...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found