czech data preparation for ASR - ffmpeg with pipe cause get_utt2dur.sh crash
See original GitHub issueThe data preparation for Czech language uses ffmepg together with pipe in wav.scp which does not support wave length in the header.
The wav.scp command is created here: https://github.com/espnet/espnet/blob/a5742a3b23d8a27c0c0ef02d105e1ab9d6321e08/egs/commonvoice/asr1/local/data_prep.pl#L57
The ffmpeg problematic issue is described at: https://trac.ffmpeg.org/ticket/7892
The error in the output of run.sh - the script level:
# script level for run.sh
utils/data/get_utt2dur.sh: could not get utterance lengths from sphere-file headers, using wav-to-duration
run.pl: 4 / 4 failed, log is in data/train_cs/log/get_durations.*.log
The error demonstration when using and not using the pipe with FFmpeg. See the duration info!
The test_bad.wav
was redirected to the file (same behavior as used by pipe).
The test_ok.wav
was saved to disc by FFmpeg directly. All other FFmpeg parameters are the same.
oplatek@hydra4:master-replicate-czech:asr1$ ffmpeg -i download/cs_data/cv-corpus-5.1-2020-06-22/cs/clips/common_voice_cs_20500128.mp3 -ar 16000 -acodec pcm_s16le -ac 1 -f wav - > test_bad.wav 2> /dev/null
oplatek@hydra4:master-replicate-czech:asr1$ ffmpeg -i download/cs_data/cv-corpus-5.1-2020-06-22/cs/clips/common_voice_cs_20500128.mp3 -ar 16000 -acodec pcm_s16le -ac 1 -f wav test_ok.wav 2> /dev/null
oplatek@hydra4:master-replicate-czech:asr1$ soxi test_bad.wav test_ok.wav
Input File : 'test_bad.wav'
Channels : 1
Sample Rate : 16000
Precision : 16-bit
Duration : 37:16:57.73 = 2147483647 samples ~ 1.00663e+07 CDDA sectors
File Size : 96.8k
Bit Rate : 5.77
Sample Encoding: 16-bit Signed Integer PCM
Input File : 'test_ok.wav'
Channels : 1
Sample Rate : 16000
Precision : 16-bit
Duration : 00:00:03.02 = 48384 samples ~ 226.8 CDDA sectors
File Size : 96.8k
Bit Rate : 256k
Sample Encoding: 16-bit Signed Integer PCM
Total Duration of 2 files: 37:17:00.75
Workaround: save the files to wave first.
Any ideas how to make it work without saving the wave files first?
Issue Analytics
- State:
- Created 3 years ago
- Comments:7
Top GitHub Comments
FYI: I have managed to run successfully first two stages with the workaround from https://github.com/kaldi-asr/kaldi/pull/4467/files.
Feel free to close this issue. I hit problems with LM training with stage 3 and I moved to the Espnet2 recipe in egs2/commonvoice/asr1 as you suggested - thank you! I hit another issue which I reported here: https://github.com/espnet/espnet/issues/3042
Good, this is better for us, thank you!
Potentially yes, maybe, ffmpeg doesn’t prepare such options, I’m not sure. Another idea is creating a reading tool to do it. This is not difficult.