utils/validate_data_dir.sh: text contains 1 lines with non-printable characters during processing mustc v1
See original GitHub issueDescribe the bug
Hello, when i use the deafult recipe for mustc v1, a bug raised(at stage 1): utils/validate_data_dir.sh: text contains 1 lines with non-printable characters I think this bug is related to #4126 and #4157, but I don’t know how to avoid it.
Task information:
- Task: ST
- Recipe: must_c
- ESPnet1
To Reproduce
- cd egs/must_c/st1`
- ./run_basectc.sh --stage 0 --stop_stage 1
#!/usr/bin/env bash
# run_basectc.sh
# Copyright 2019 Kyoto University (Hirofumi Inaguma)
# Apache 2.0 (http://www.apache.org/licenses/LICENSE-2.0)
. ./path.sh || exit 1;
. ./cmd.sh || exit 1;
# general configuration
backend=pytorch
stage=-1 # start from -1 if you need to start from data download
stop_stage=100
ngpu=1 # number of gpus during training ("0" uses cpu, otherwise use gpu)
dec_ngpu=0 # number of gpus during decoding ("0" uses cpu, otherwise use gpu)
nj=8 # number of parallel jobs for decoding
debugmode=1
dumpdir=dump # directory to dump full features
N=0 # number of minibatches to be used (mainly for debugging). "0" uses all minibatches.
verbose=0 # verbose option
resume= # Resume the training from snapshot
seed=1 # seed to generate random number
# feature configuration
do_delta=false
preprocess_config=conf/specaug.yaml
train_config=conf/train.yaml
decode_config=conf/decode.yaml
# decoding parameter
trans_model=model.acc.best # set a model to be used for decoding: 'model.acc.best' or 'model.loss.best'
# model average realted (only for transformer)
n_average=10 # the number of ST models to be averaged
use_valbest_average=true # if true, the validation `n_average`-best ST models will be averaged.
# if false, the last `n_average` ST models will be averaged.
metric=bleu # loss/acc/bleu
# pre-training related
asr_model=
mt_model=
# preprocessing related
src_case=lc.rm
tgt_case=tc
# tc: truecase
# lc: lowercase
# lc.rm: lowercase with punctuation removal
# postprocessing related
remove_nonverbal=true # remove non-verbal labels such as "( Applaus )"
# NOTE: IWSLT community accepts this setting and therefore we use this by default
# Set this to somewhere where you want to put your data, or where
# someone else has already put it.
must_c=mustc_v1.0
# target language related
tgt_lang=de
# you can choose from de, es, fr, it, nl, pt, ro, ru
# bpemode (unigram or bpe)
nbpe=10000
bpemode=unigram
# exp tag
tag="" # tag for managing experiments.
. utils/parse_options.sh || exit 1;
# Set bash to 'debug' mode, it will exit on :
# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',
set -e
set -u
set -o pipefail
train_set=train_sp.en-${tgt_lang}.${tgt_lang}
train_dev=dev.en-${tgt_lang}.${tgt_lang}
trans_set="dev_org.en-${tgt_lang}.${tgt_lang} tst-COMMON.en-${tgt_lang}.${tgt_lang} tst-HE.en-${tgt_lang}.${tgt_lang}"
if [ ${stage} -le -1 ] && [ ${stop_stage} -ge -1 ]; then
echo "stage -1: Data Download"
local/download_and_untar.sh ${must_c} ${tgt_lang} "v1"
fi
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
### Task dependent. You have to make data the following preparation part by yourself.
### But you can utilize Kaldi recipes in most cases
echo "stage 0: Data Preparation"
local/data_prep.sh ${must_c} ${tgt_lang} "v1"
fi
feat_tr_dir=${dumpdir}/${train_set}/delta${do_delta}; mkdir -p ${feat_tr_dir}
feat_dt_dir=${dumpdir}/${train_dev}/delta${do_delta}; mkdir -p ${feat_dt_dir}
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
### Task dependent. You have to design training and dev sets by yourself.
### But you can utilize Kaldi recipes in most cases
echo "stage 1: Feature Generation"
fbankdir=fbank
# Generate the fbank features; by default 80-dimensional fbanks with pitch on each frame
for lang in $(echo ${tgt_lang} | tr '_' ' '); do
for x in dev.en-${tgt_lang} tst-COMMON.en-${tgt_lang} tst-HE.en-${tgt_lang}; do
steps/make_fbank_pitch.sh --cmd "$train_cmd" --nj 32 --write_utt2num_frames true \
data/${x} exp/make_fbank/${x} ${fbankdir}
done
done
# speed perturbation
speed_perturb.sh --cmd "$train_cmd" --speeds "1.0" --cases "lc.rm lc tc" --langs "en ${tgt_lang}" data/train.en-${tgt_lang} data/train_sp.en-${tgt_lang} ${fbankdir}
# Divide into source and target languages
for x in train_sp.en-${tgt_lang} dev.en-${tgt_lang} tst-COMMON.en-${tgt_lang} tst-HE.en-${tgt_lang}; do
divide_lang.sh ${x} "en ${tgt_lang}"
done
for lang in ${tgt_lang} en; do
cp -rf data/dev.en-${tgt_lang}.${lang} data/dev_org.en-${tgt_lang}.${lang}
done
# remove long and short utterances
for x in train_sp.en-${tgt_lang} dev.en-${tgt_lang}; do
clean_corpus.sh --maxframes 3000 --maxchars 400 --utt_extra_files "text.tc text.lc text.lc.rm" data/${x} "en ${tgt_lang}"
done
# compute global CMVN
compute-cmvn-stats scp:data/${train_set}/feats.scp data/${train_set}/cmvn.ark
# dump features for training
dump.sh --cmd "$train_cmd" --nj 80 --do_delta $do_delta \
data/${train_set}/feats.scp data/${train_set}/cmvn.ark exp/dump_feats/${train_set} ${feat_tr_dir}
for x in ${train_dev} ${trans_set}; do
feat_trans_dir=${dumpdir}/${x}/delta${do_delta}; mkdir -p ${feat_trans_dir}
dump.sh --cmd "$train_cmd" --nj 32 --do_delta $do_delta \
data/${x}/feats.scp data/${train_set}/cmvn.ark exp/dump_feats/trans/${x} ${feat_trans_dir}
done
fi
dict=data/lang_1spm/${train_set}_${bpemode}${nbpe}_units_${tgt_case}.txt
nlsyms=data/lang_1spm/${train_set}_non_lang_syms_${tgt_case}.txt
bpemodel=data/lang_1spm/${train_set}_${bpemode}${nbpe}_${tgt_case}
echo "dictionary: ${dict}"
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
### Task dependent. You have to check non-linguistic symbols used in the corpus.
echo "stage 2: Dictionary and Json Data Preparation"
mkdir -p data/lang_1spm/
echo "make a non-linguistic symbol list for all languages"
grep sp1.0 data/train_sp.en-${tgt_lang}.*/text.${tgt_case} | cut -f 2- -d' ' | grep -o -P '&[^;]*;'| sort | uniq > ${nlsyms}
cat ${nlsyms}
echo "make a joint source and target dictionary"
echo "<unk> 1" > ${dict} # <unk> must be 1, 0 will be used for "blank" in CTC
offset=$(wc -l < ${dict})
grep sp1.0 data/train_sp.en-${tgt_lang}.*/text.${tgt_case} | cut -f 2- -d' ' | grep -v -e '^\s*$' > data/lang_1spm/input_${tgt_lang}_${src_case}_${tgt_case}.txt
spm_train --user_defined_symbols="$(tr "\n" "," < ${nlsyms})" --input=data/lang_1spm/input_${tgt_lang}_${src_case}_${tgt_case}.txt \
--vocab_size=${nbpe} --model_type=${bpemode} --model_prefix=${bpemodel} --input_sentence_size=100000000 --character_coverage=1.0
spm_encode --model=${bpemodel}.model --output_format=piece < data/lang_1spm/input_${tgt_lang}_${src_case}_${tgt_case}.txt \
| tr ' ' '\n' | sort | uniq | awk -v offset=${offset} '{print $0 " " NR+offset}' >> ${dict}
wc -l ${dict}
echo "make json files"
data2json.sh --nj 16 --feat ${feat_tr_dir}/feats.scp --text data/${train_set}/text.${tgt_case} --bpecode ${bpemodel}.model --lang "${tgt_lang}" \
data/${train_set} ${dict} > ${feat_tr_dir}/data_${bpemode}${nbpe}.${src_case}_${tgt_case}.json
for x in ${train_dev} ${trans_set}; do
feat_trans_dir=${dumpdir}/${x}/delta${do_delta}
data2json.sh --feat ${feat_trans_dir}/feats.scp --text data/${x}/text.${tgt_case} --bpecode ${bpemodel}.model --lang "${tgt_lang}" \
data/${x} ${dict} > ${feat_trans_dir}/data_${bpemode}${nbpe}.${src_case}_${tgt_case}.json
done
# update json (add source references)
for x in ${train_set} ${train_dev} ${trans_set}; do
feat_dir=${dumpdir}/${x}/delta${do_delta}
data_dir=data/$(echo ${x} | cut -f 1 -d ".").en-${tgt_lang}.en
update_json.sh --text ${data_dir}/text.${src_case} --bpecode ${bpemodel}.model \
${feat_dir}/data_${bpemode}${nbpe}.${src_case}_${tgt_case}.json ${data_dir} ${dict}
done
fi
# NOTE: skip stage 3: LM Preparation
if [ -z ${tag} ]; then
expname=${train_set}_${tgt_case}_${backend}_$(basename ${train_config%.*})_${bpemode}${nbpe}
if ${do_delta}; then
expname=${expname}_delta
fi
if [ -n "${preprocess_config}" ]; then
expname=${expname}_$(basename ${preprocess_config%.*})
fi
if [ -n "${asr_model}" ]; then
expname=${expname}_asrtrans
fi
if [ -n "${mt_model}" ]; then
expname=${expname}_mttrans
fi
else
expname=${train_set}_${tgt_case}_${backend}_${tag}
fi
expdir=exp/${expname}
mkdir -p ${expdir}
if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
echo "stage 4: Network Training"
${cuda_cmd} --gpu ${ngpu} ${expdir}/train.log \
st_train.py \
--config ${train_config} \
--preprocess-conf ${preprocess_config} \
--ngpu ${ngpu} \
--backend ${backend} \
--outdir ${expdir}/results \
--tensorboard-dir tensorboard/${expname} \
--debugmode ${debugmode} \
--dict ${dict} \
--debugdir ${expdir} \
--minibatches ${N} \
--seed ${seed} \
--verbose ${verbose} \
--resume ${resume} \
--train-json ${feat_tr_dir}/data_${bpemode}${nbpe}.${src_case}_${tgt_case}.json \
--valid-json ${feat_dt_dir}/data_${bpemode}${nbpe}.${src_case}_${tgt_case}.json \
--enc-init ${asr_model} \
--dec-init ${mt_model}
fi
if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
echo "stage 5: Decoding"
if [[ $(get_yaml.py ${train_config} model-module) = *transformer* ]] || \
[[ $(get_yaml.py ${train_config} model-module) = *conformer* ]]; then
# Average ST models
if ${use_valbest_average}; then
trans_model=model.val${n_average}.avg.best
opt="--log ${expdir}/results/log --metric ${metric}"
else
trans_model=model.last${n_average}.avg.best
opt="--log"
fi
average_checkpoints.py \
${opt} \
--backend ${backend} \
--snapshots ${expdir}/results/snapshot.ep.* \
--out ${expdir}/results/${trans_model} \
--num ${n_average}
fi
if [ ${dec_ngpu} = 1 ]; then
nj=1
fi
pids=() # initialize pids
for x in ${trans_set}; do
(
decode_dir=decode_${x}_$(basename ${decode_config%.*})
feat_trans_dir=${dumpdir}/${x}/delta${do_delta}
# reset log for RTF calculation
if [ -f ${expdir}/${decode_dir}/log/decode.1.log ]; then
rm ${expdir}/${decode_dir}/log/decode.*.log
fi
# split data
splitjson.py --parts ${nj} ${feat_trans_dir}/data_${bpemode}${nbpe}.${src_case}_${tgt_case}.json
${decode_cmd} JOB=1:${nj} ${expdir}/${decode_dir}/log/decode.JOB.log \
st_trans.py \
--config ${decode_config} \
--ngpu ${dec_ngpu} \
--backend ${backend} \
--batchsize 0 \
--trans-json ${feat_trans_dir}/split${nj}utt/data_${bpemode}${nbpe}.JOB.json \
--result-label ${expdir}/${decode_dir}/data.JOB.json \
--model ${expdir}/results/${trans_model}
score_bleu.sh --case ${tgt_case} --bpemodel ${bpemodel}.model \
--remove_nonverbal ${remove_nonverbal} \
${expdir}/${decode_dir} ${tgt_lang} ${dict}
calculate_rtf.py --log-dir ${expdir}/${decode_dir}/log
) &
pids+=($!) # store background pids
done
i=0; for pid in "${pids[@]}"; do wait ${pid} || ((++i)); done
[ ${i} -gt 0 ] && echo "$0: ${i} background jobs are failed." && false
echo "Finished"
fi
Error logs (espnet) [.xxx@localhost st1]$ ./run_basectc.sh --stage 0 --stop_stage 0 stage 0: Data Preparation remove duplicate lines… Reduced #utt from 229699 to 229696 fix_data_dir.sh: kept all 229696 utterances. fix_data_dir.sh: old files are kept in data/train.en-de/.backup local/data_prep.sh: successfully prepared data in data/local/en-de/train remove duplicate lines… Reduced #utt from 1423 to 1423 fix_data_dir.sh: kept all 1423 utterances. fix_data_dir.sh: old files are kept in data/dev.en-de/.backup local/data_prep.sh: successfully prepared data in data/local/en-de/dev remove duplicate lines… Reduced #utt from 2641 to 2641 fix_data_dir.sh: kept all 2641 utterances. fix_data_dir.sh: old files are kept in data/tst-COMMON.en-de/.backup local/data_prep.sh: successfully prepared data in data/local/en-de/tst-COMMON remove duplicate lines… Reduced #utt from 600 to 600 fix_data_dir.sh: kept all 600 utterances. fix_data_dir.sh: old files are kept in data/tst-HE.en-de/.backup local/data_prep.sh: successfully prepared data in data/local/en-de/tst-HE dictionary: data/lang_1spm/train_sp.en-de.de_unigram10000_units_tc.txt (espnet) [xxx@localhost st1]$ ./run_basectc.sh --stage 1 --stop_stage 1 stage 1: Feature Generation steps/make_fbank_pitch.sh --cmd run.pl --nj 32 --write_utt2num_frames true data/dev.en-de exp/make_fbank/dev.en-de fbank utils/validate_data_dir.sh: Successfully validated data-directory data/dev.en-de steps/make_fbank_pitch.sh [info]: segments file exists: using that. steps/make_fbank_pitch.sh: Succeeded creating filterbank and pitch features for dev.en-de steps/make_fbank_pitch.sh --cmd run.pl --nj 32 --write_utt2num_frames true data/tst-COMMON.en-de exp/make_fbank/tst-COMMON.en-de fbank utils/validate_data_dir.sh: Successfully validated data-directory data/tst-COMMON.en-de steps/make_fbank_pitch.sh [info]: segments file exists: using that. steps/make_fbank_pitch.sh: Succeeded creating filterbank and pitch features for tst-COMMON.en-de steps/make_fbank_pitch.sh --cmd run.pl --nj 32 --write_utt2num_frames true data/tst-HE.en-de exp/make_fbank/tst-HE.en-de fbank utils/validate_data_dir.sh: Successfully validated data-directory data/tst-HE.en-de steps/make_fbank_pitch.sh [info]: segments file exists: using that. steps/make_fbank_pitch.sh: Succeeded creating filterbank and pitch features for tst-HE.en-de /mnt/hanyuchen/espnet/espnet/egs/must_c/st1/…/…/…/utils/speed_perturb.sh --cmd run.pl --speeds 1.0 --cases lc.rm lc tc --langs en de data/train.en-de data/train_sp.en-de fbank utils/data/get_utt2dur.sh: working out data/train.en-de/utt2dur from data/train.en-de/segments utils/data/get_utt2dur.sh: computed data/train.en-de/utt2dur utils/data/get_reco2dur.sh: obtaining durations from recordings utils/data/get_reco2dur.sh: could not get recording lengths from sphere-file headers, using wav-to-duration utils/data/get_reco2dur.sh: computed data/train.en-de/reco2dur utils/perturb_data_dir_speed.sh: generated speed-perturbed version of data in data/train.en-de, in data/train.en-de/tmp-hTyC9/temp.1.0 fix_data_dir.sh: kept all 229696 utterances. fix_data_dir.sh: old files are kept in data/train.en-de/tmp-hTyC9/temp.1.0/.backup utils/validate_data_dir.sh: Successfully validated data-directory data/train.en-de/tmp-hTyC9/temp.1.0 utils/combine_data.sh --extra-files utt2uniq data/train_sp.en-de data/train.en-de/tmp-hTyC9/temp.1.0 utils/combine_data.sh: combined utt2uniq utils/combine_data.sh: combined segments utils/combine_data.sh: combined utt2spk utils/combine_data.sh [info]: not combining utt2lang as it does not exist utils/combine_data.sh: combined utt2dur utils/combine_data.sh [info]: not combining utt2num_frames as it does not exist utils/combine_data.sh: combined reco2dur utils/combine_data.sh [info]: not combining feats.scp as it does not exist utils/combine_data.sh [info]: not combining text as it does not exist utils/combine_data.sh [info]: not combining cmvn.scp as it does not exist utils/combine_data.sh [info]: not combining vad.scp as it does not exist utils/combine_data.sh [info]: not combining reco2file_and_channel as it does not exist utils/combine_data.sh: combined wav.scp utils/combine_data.sh [info]: not combining spk2gender as it does not exist fix_data_dir.sh: kept all 229696 utterances. fix_data_dir.sh: old files are kept in data/train_sp.en-de/.backup steps/make_fbank_pitch.sh --cmd run.pl --nj 32 --write_utt2num_frames true data/train_sp.en-de exp/make_fbank/train_sp.en-de fbank utils/validate_data_dir.sh: Successfully validated data-directory data/train_sp.en-de steps/make_fbank_pitch.sh [info]: segments file exists: using that. steps/make_fbank_pitch.sh: Succeeded creating filterbank and pitch features for train_sp.en-de fix_data_dir.sh: kept all 229696 utterances. fix_data_dir.sh: old files are kept in data/train_sp.en-de/.backup utils/validate_data_dir.sh: Successfully validated data-directory data/train_sp.en-de fix_data_dir.sh: kept all 229696 utterances. fix_data_dir.sh: old files are kept in data/train_sp.en-de.en/.backup utils/validate_data_dir.sh: text contains 1 lines with non-printable characters
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (2 by maintainers)
Top GitHub Comments
–non_print option is introduced in kaldi recently. You can
utils/validate_data_dir.sh
to setnon_print=true
. https://github.com/kaldi-asr/kaldi/blob/3ec108da76e3d9dba901fb69f046d0e46170b8e7/egs/wsj/s5/utils/validate_data_dir.sh#L9It works for me, thank you!