Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

error when dumping gigaspeech XL feature

See original GitHub issue

Hi, Piotr@pzelasko I am trying to dump gigaspeech XL subset feature with this https://github.com/k2-fsa/icefall/pull/100. After 3.4G features extracted, the process is terminated unexpectly.

Question 1: Should I config something to fix this error? Question 2: Is it possible to resume the extraction? i.e. Now 3.4G features are already extraced, how to avoid re-extacting feature of this part of utts? Question 3: how big will feats_gigaspeech_XL.h5 be if everthing goes fine? Is it Around 1TG (since 10,000h XL = 1000 * 10h XS, and feats_gigaspeech_XS.h5 is 960M)?

descriptions of related files:

-rw-r--r-- 1 guoliyong guoliyong 1.1M Nov  4 13:01 gigaspeech_cuts_XS_raw.jsonl.gz
-rw-r--r-- 1 guoliyong guoliyong 960M Nov  4 13:23 feats_gigaspeech_XS.h5
-rw-r--r-- 1 guoliyong guoliyong 998M Nov  4 17:32 gigaspeech_cuts_XL_raw.jsonl.gz
-rw-r--r-- 1 guoliyong guoliyong 3.4G Nov  5 07:19 feats_gigaspeech_XL.h5

Here is the log

Filtering OOV utterances from supervisions
Normalizing text in XL
Processing XL
About to split XL raw cuts into smaller chunks.
Computing features in batches:   0%|_                                                | 90663/24848964 [13:18:26<3634:01:08,  1.89it/s]
Traceback (most recent call last):
  File "prepare_gigaspeech.py", line 311, in <module>
    main()
  File "prepare_gigaspeech.py", line 267, in main
    cut_set = cut_set.compute_and_store_features_batch(
  File "/ceph-ly/open-source/giga_librispeech/lhotse/lhotse/cut.py", line 3427, in compute_and_store_features_batch
    for batch in dloader:
  File "/ceph-ly/py38/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 517, in __next__
    data = self._next_data()
  File "/ceph-ly/py38/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1179, in _next_data    return self._process_data(data)  File "/ceph-ly/py38/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1225, in _process_data    data.reraise()  File "/ceph-ly/py38/lib/python3.8/site-packages/torch/_utils.py", line 429, in reraise
    raise self.exc_type(msg)
ValueError: Caught ValueError in DataLoader worker process 3.Original Traceback (most recent call last):  File "/ceph-ly/py38/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 202, in _worker_loop
    data = fetcher.fetch(index)
  File "/ceph-ly/py38/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 46, in fetch
    data = self.dataset[possibly_batched_index]
  File "/ceph-ly/open-source/giga_librispeech/lhotse/lhotse/dataset/unsupervised.py", line 70, in __getitem__
    return {"cuts": cuts, "audio": [c.load_audio() for c in cuts]}
  File "/ceph-ly/open-source/giga_librispeech/lhotse/lhotse/dataset/unsupervised.py", line 70, in <listcomp>
    return {"cuts": cuts, "audio": [c.load_audio() for c in cuts]}
  File "/ceph-ly/open-source/giga_librispeech/lhotse/lhotse/cut.py", line 844, in load_audio
    return self.recording.load_audio(
  File "/ceph-ly/open-source/giga_librispeech/lhotse/lhotse/audio.py", line 345, in load_audio
    samples = source.load_audio(
  File "/ceph-ly/open-source/giga_librispeech/lhotse/lhotse/audio.py", line 128, in load_audio
    samples, sampling_rate = read_audio(
  File "/ceph-ly/open-source/giga_librispeech/lhotse/lhotse/caching.py", line 70, in wrapper
    return m(*args, **kwargs)
  File "/ceph-ly/open-source/giga_librispeech/lhotse/lhotse/audio.py", line 948, in read_audio
    return read_opus(
  File "/ceph-ly/open-source/giga_librispeech/lhotse/lhotse/audio.py", line 1336, in read_opus
    channel_string = parse_channel_from_ffmpeg_output(proc.stderr)
  File "/ceph-ly/open-source/giga_librispeech/lhotse/lhotse/audio.py", line 1369, in parse_channel_from_ffmpeg_output
    raise ValueError(
ValueError: Could not determine the number of channels for OPUS file from the following ffmpeg output (shown as bytestring due to avoid possible encoding issues):
b''

Issue Analytics

State:
Created 2 years ago
Comments:28 (14 by maintainers)

Top GitHub Comments

2reactions

pzelaskocommented, Nov 5, 2021

Question 1: Should I config something to fix this error?

This seems like it might be some error in manifests. I’ll try to add some extra exception handling in Lhotse later today to display the cut ID, so we can pinpoint which cut is creating the problem.

Question 2: Is it possible to resume the extraction? i.e. Now 3.4G features are already extraced, how to avoid re-extacting feature of this part of utts?

Not at the moment, but let me think about it – I might have a idea how to add it.

Question 3: how big will feats_gigaspeech_XL.h5 be if everthing goes fine? Is it Around 1TG (since 10,000h XL = 1000 * 10h XS, and feats_gigaspeech_XL.h5 is 960M)?

That’s true, it might end up being pretty large. I would either advise to compute them on-the-fly, or to add extra compression by controlling lilcom’s tick_size (default is -5, I think -3 still works pretty much identical but I didn’t test it too much). You can pass the lilcom argument like this:

# Add the following imports
from functools import partial
from lhotse import LilcomHdf5Writer

cut_set = cut_set.compute_and_store_features_batch(
                         extractor=extractor,
                         storage_path=f"{output_dir}/feats_gigaspeech_{partition}",
                         batch_duration=args.batch_duration,
                         num_workers=args.num_workers,
                         # add the line below:
                         storage_type=partial(LilcomHdf5Writer, tick_power=-3),
                     )

I think HDF5 is also not the most space-efficient storage for uneven length arrays and variable-length byte sequences, but I don’t have a good idea for any alternative backend atm.

@glynpu one more thing I noticed: you’re not using kaldifeat, so you’re not getting any speed benefit out of _batch feature extraction. To leverage kaldifeat, replace the following line:

    extractor = Fbank(FbankConfig(num_mel_bins=80))

with:

    extractor = KaldifeatFbank(KaldifeatFbankConfig(device='cuda'))  # default config uses 80 mel bins already

1reaction

pzelaskocommented, Nov 16, 2021

Note the “Killed” message — something / somebody is killing your tasks. Maybe you’re running OOM on the node or using more resources than allocated? You can try decreasing batch size or num jobs, or ask around other users of your system for help.

Top Results From Across the Web

OPUS reading is ~3x slower compared to ffmpeg in a ... - GitHub

Describe the bug Technically it's not a bug, but it was the most ... error when dumping gigaspeech XL feature lhotse-speech/lhotse#452.

esc-benchmark/esc-datasets · Datasets at Hugging Face

dataset (string) text (string) id (string) "ami" "I've gotten mm hardly any" "AMI_EN2001a_H00_MEE068_0414915_0415... "ami" "Yeah." "AMI_EN2001a_H03_MEE067_0478127_0478... "ami" "Hmm." "AMI_EN2001a_H02_FEO065_0436920_0436...

GigaSpeech: An Evolving, Multi-domain ASR Corpus ... - arXiv

This paper introduces GigaSpeech, an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio ...

State-of-the-art German speech recognition in 284 lines of C++

This model seems strongly overtrained on CV test set. Usually improvement from LM rescoring is just 10% relative.

Speech Recognition on GigaSpeech - Papers With Code

The current state-of-the-art on GigaSpeech is Conformer/Transformer-AED. ... Other models Models with lowest Word Error Rate (WER) 13. Jun 10.9.