error when dumping gigaspeech XL feature
See original GitHub issueHi, Piotr@pzelasko I am trying to dump gigaspeech XL subset feature with this https://github.com/k2-fsa/icefall/pull/100. After 3.4G features extracted, the process is terminated unexpectly.
Question 1: Should I config something to fix this error? Question 2: Is it possible to resume the extraction? i.e. Now 3.4G features are already extraced, how to avoid re-extacting feature of this part of utts? Question 3: how big will feats_gigaspeech_XL.h5 be if everthing goes fine? Is it Around 1TG (since 10,000h XL = 1000 * 10h XS, and feats_gigaspeech_XS.h5 is 960M)?
descriptions of related files:
-rw-r--r-- 1 guoliyong guoliyong 1.1M Nov 4 13:01 gigaspeech_cuts_XS_raw.jsonl.gz
-rw-r--r-- 1 guoliyong guoliyong 960M Nov 4 13:23 feats_gigaspeech_XS.h5
-rw-r--r-- 1 guoliyong guoliyong 998M Nov 4 17:32 gigaspeech_cuts_XL_raw.jsonl.gz
-rw-r--r-- 1 guoliyong guoliyong 3.4G Nov 5 07:19 feats_gigaspeech_XL.h5
Here is the log
Filtering OOV utterances from supervisions
Normalizing text in XL
Processing XL
About to split XL raw cuts into smaller chunks.
Computing features in batches: 0%|_ | 90663/24848964 [13:18:26<3634:01:08, 1.89it/s]
Traceback (most recent call last):
File "prepare_gigaspeech.py", line 311, in <module>
main()
File "prepare_gigaspeech.py", line 267, in main
cut_set = cut_set.compute_and_store_features_batch(
File "/ceph-ly/open-source/giga_librispeech/lhotse/lhotse/cut.py", line 3427, in compute_and_store_features_batch
for batch in dloader:
File "/ceph-ly/py38/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 517, in __next__
data = self._next_data()
File "/ceph-ly/py38/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1179, in _next_data return self._process_data(data) File "/ceph-ly/py38/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1225, in _process_data data.reraise() File "/ceph-ly/py38/lib/python3.8/site-packages/torch/_utils.py", line 429, in reraise
raise self.exc_type(msg)
ValueError: Caught ValueError in DataLoader worker process 3.Original Traceback (most recent call last): File "/ceph-ly/py38/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 202, in _worker_loop
data = fetcher.fetch(index)
File "/ceph-ly/py38/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 46, in fetch
data = self.dataset[possibly_batched_index]
File "/ceph-ly/open-source/giga_librispeech/lhotse/lhotse/dataset/unsupervised.py", line 70, in __getitem__
return {"cuts": cuts, "audio": [c.load_audio() for c in cuts]}
File "/ceph-ly/open-source/giga_librispeech/lhotse/lhotse/dataset/unsupervised.py", line 70, in <listcomp>
return {"cuts": cuts, "audio": [c.load_audio() for c in cuts]}
File "/ceph-ly/open-source/giga_librispeech/lhotse/lhotse/cut.py", line 844, in load_audio
return self.recording.load_audio(
File "/ceph-ly/open-source/giga_librispeech/lhotse/lhotse/audio.py", line 345, in load_audio
samples = source.load_audio(
File "/ceph-ly/open-source/giga_librispeech/lhotse/lhotse/audio.py", line 128, in load_audio
samples, sampling_rate = read_audio(
File "/ceph-ly/open-source/giga_librispeech/lhotse/lhotse/caching.py", line 70, in wrapper
return m(*args, **kwargs)
File "/ceph-ly/open-source/giga_librispeech/lhotse/lhotse/audio.py", line 948, in read_audio
return read_opus(
File "/ceph-ly/open-source/giga_librispeech/lhotse/lhotse/audio.py", line 1336, in read_opus
channel_string = parse_channel_from_ffmpeg_output(proc.stderr)
File "/ceph-ly/open-source/giga_librispeech/lhotse/lhotse/audio.py", line 1369, in parse_channel_from_ffmpeg_output
raise ValueError(
ValueError: Could not determine the number of channels for OPUS file from the following ffmpeg output (shown as bytestring due to avoid possible encoding issues):
b''
Issue Analytics
- State:
- Created 2 years ago
- Comments:28 (14 by maintainers)
Top Results From Across the Web
OPUS reading is ~3x slower compared to ffmpeg in a ... - GitHub
Describe the bug Technically it's not a bug, but it was the most ... error when dumping gigaspeech XL feature lhotse-speech/lhotse#452.
Read more >esc-benchmark/esc-datasets · Datasets at Hugging Face
dataset (string) text (string) id (string)
"ami" "I've gotten mm hardly any" "AMI_EN2001a_H00_MEE068_0414915_0415...
"ami" "Yeah." "AMI_EN2001a_H03_MEE067_0478127_0478...
"ami" "Hmm." "AMI_EN2001a_H02_FEO065_0436920_0436...
Read more >GigaSpeech: An Evolving, Multi-domain ASR Corpus ... - arXiv
This paper introduces GigaSpeech, an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio ...
Read more >State-of-the-art German speech recognition in 284 lines of C++
This model seems strongly overtrained on CV test set. Usually improvement from LM rescoring is just 10% relative.
Read more >Speech Recognition on GigaSpeech - Papers With Code
The current state-of-the-art on GigaSpeech is Conformer/Transformer-AED. ... Other models Models with lowest Word Error Rate (WER) 13. Jun 10.9.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
This seems like it might be some error in manifests. I’ll try to add some extra exception handling in Lhotse later today to display the cut ID, so we can pinpoint which cut is creating the problem.
Not at the moment, but let me think about it – I might have a idea how to add it.
That’s true, it might end up being pretty large. I would either advise to compute them on-the-fly, or to add extra compression by controlling lilcom’s
tick_size
(default is -5, I think -3 still works pretty much identical but I didn’t test it too much). You can pass the lilcom argument like this:I think HDF5 is also not the most space-efficient storage for uneven length arrays and variable-length byte sequences, but I don’t have a good idea for any alternative backend atm.
@glynpu one more thing I noticed: you’re not using
kaldifeat
, so you’re not getting any speed benefit out of_batch
feature extraction. To leveragekaldifeat
, replace the following line:with:
Note the “Killed” message — something / somebody is killing your tasks. Maybe you’re running OOM on the node or using more resources than allocated? You can try decreasing batch size or num jobs, or ask around other users of your system for help.