question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Some problems when loading the TedLium3 dataset for transducer-stateless training

See original GitHub issue

Currently, I am trying to build a transducer-stateless recipe based on Tedlium3 for icefall. This is the PR. (https://github.com/k2-fsa/icefall/pull/183). This PR shows the concrete codes for processing and loading the tedlium dataset. In train.py, we also use the function remove_short_and_long_utt for filtering.

When I use the following codes for loading the data in train.py:

tedlium = TedLiumAsrDataModule(args)
train_cuts = tedlium.train_cuts()

There is an error:

2022-01-26 17:06:37,134 INFO [train.py:577] About to create model
2022-01-26 17:06:37,804 INFO [train.py:581] Number of model parameters: 84007924
2022-01-26 17:06:41,550 INFO [asr_datamodule.py:341] About to get train cuts
2022-01-26 17:06:53,750 INFO [train.py:618] Before removing short and long utterances: 7053
2022-01-26 17:06:53,751 INFO [train.py:619] After removing short and long utterances: 0
2022-01-26 17:06:53,751 INFO [train.py:620] Removed 7053 utterances (100.00000%)
2022-01-26 17:06:53,751 INFO [asr_datamodule.py:176] About to get Musan cuts
2022-01-26 17:06:55,585 INFO [asr_datamodule.py:183] Enable MUSAN
2022-01-26 17:06:55,586 INFO [asr_datamodule.py:208] Enable SpecAugment
2022-01-26 17:06:55,586 INFO [asr_datamodule.py:209] Time warp factor: 80
2022-01-26 17:06:55,586 INFO [asr_datamodule.py:224] About to create train dataset
2022-01-26 17:06:55,586 INFO [asr_datamodule.py:252] Using BucketingSampler.
Traceback (most recent call last):
  File "transducer_stateless/train.py", line 733, in <module>
    main()
  File "transducer_stateless/train.py", line 726, in main
    run(rank=0, world_size=1, args=args)
  File "transducer_stateless/train.py", line 622, in run
    train_dl = tedlium.train_dataloaders(train_cuts)
  File "/ceph-meixu/luomingshuang/icefall/egs/tedlium3/ASR/transducer_stateless/asr_datamodule.py", line 253, in train_dataloaders
    train_sampler = BucketingSampler(
  File "/ceph-meixu/luomingshuang/anaconda3/envs/k2-python/lib/python3.8/site-packages/lhotse-1.0.0.dev0+git.6a3192a.clean-py3.8.egg/lhotse/dataset/sampling/bucketing.py", line 108, in __init__
  File "/ceph-meixu/luomingshuang/anaconda3/envs/k2-python/lib/python3.8/site-packages/lhotse-1.0.0.dev0+git.6a3192a.clean-py3.8.egg/lhotse/dataset/sampling/bucketing.py", line 392, in create_buckets_equal_duration
  File "/ceph-meixu/luomingshuang/anaconda3/envs/k2-python/lib/python3.8/site-packages/lhotse-1.0.0.dev0+git.6a3192a.clean-py3.8.egg/lhotse/dataset/sampling/bucketing.py", line 430, in _create_buckets_equal_duration_single
IndexError: pop from empty list

So I read the cuts_train.json.gz with vim data/fbank/cuts_train.json.gz, and shows as follows: 4234d250ee9f674fecf2d7fd105d418

As shows in the above picture, the each cut’s duration is too long. So the samples are filtered.

For fixing this issue, I try to use the following codes for loading the data:

tedlium = TedLiumAsrDataModule(args)
train_cuts = tedlium.train_cuts()
train_cuts = train_cuts.trim_to_supervisions()

There is an error:

2022-01-26 17:17:15,419 INFO [train.py:577] About to create model
2022-01-26 17:17:16,063 INFO [train.py:581] Number of model parameters: 84007924
2022-01-26 17:17:19,781 INFO [asr_datamodule.py:341] About to get train cuts
2022-01-26 17:18:35,665 INFO [train.py:618] Before removing short and long utterances: 804789
2022-01-26 17:18:35,665 INFO [train.py:619] After removing short and long utterances: 801989
2022-01-26 17:18:35,665 INFO [train.py:620] Removed 2800 utterances (0.34792%)
2022-01-26 17:18:35,665 INFO [asr_datamodule.py:176] About to get Musan cuts
2022-01-26 17:18:38,257 INFO [asr_datamodule.py:183] Enable MUSAN
2022-01-26 17:18:38,257 INFO [asr_datamodule.py:208] Enable SpecAugment
2022-01-26 17:18:38,257 INFO [asr_datamodule.py:209] Time warp factor: 80
2022-01-26 17:18:38,257 INFO [asr_datamodule.py:224] About to create train dataset
2022-01-26 17:18:38,257 INFO [asr_datamodule.py:252] Using BucketingSampler.
2022-01-26 17:18:41,597 INFO [asr_datamodule.py:268] About to create train dataloader
2022-01-26 17:18:41,597 INFO [asr_datamodule.py:348] About to get dev cuts
2022-01-26 17:18:41,695 INFO [asr_datamodule.py:289] About to create dev dataset
2022-01-26 17:18:41,696 INFO [asr_datamodule.py:308] About to create dev dataloader
2022-01-26 17:18:41,697 INFO [train.py:685] Sanity check -- see if any of the batches in epoch 0 would cause OOM.
Traceback (most recent call last):
  File "transducer_stateless/train.py", line 733, in <module>
    main()
  File "transducer_stateless/train.py", line 726, in main
    run(rank=0, world_size=1, args=args)
  File "transducer_stateless/train.py", line 628, in run
    scan_pessimistic_batches_for_oom(
  File "transducer_stateless/train.py", line 690, in scan_pessimistic_batches_for_oom
    batch = train_dl.dataset[cuts]
  File "/ceph-meixu/luomingshuang/anaconda3/envs/k2-python/lib/python3.8/site-packages/lhotse-1.0.0.dev0+git.6a3192a.clean-py3.8.egg/lhotse/dataset/speech_recognition.py", line 99, in __getitem__
  File "/ceph-meixu/luomingshuang/anaconda3/envs/k2-python/lib/python3.8/site-packages/lhotse-1.0.0.dev0+git.6a3192a.clean-py3.8.egg/lhotse/dataset/speech_recognition.py", line 206, in validate_for_asr
AssertionError: Supervisions starting before the cut are not supported for ASR (sup id: ClayShirky_2005G-126, cut id: 7840abc4-ea04-003f-4314-ee1381d764dd)

As shows in the above picture, some short cut’s start time may be negative (<0).

For fixing this issue, I try to use the following codes for loading the data:

tedlium = TedLiumAsrDataModule(args)
train_cuts = tedlium.train_cuts()
train_cuts = train_cuts.trim_to_supervisions(keep_overlapping=False)

It can load data with the above codes normally. But it takes much time to generate a batch in GPU. And the volatile GPU-Util is 0 for a long time.

I am trying to change long cuts to short cuts before computing the fbank feature. (In compute_fbank_tedlium.py)

            cut_set = CutSet.from_manifests(
                recordings=m["recordings"],
                supervisions=m["supervisions"],
            ).trim_to_supervisions(keep_overlapping=False)

Are there some advices for this issue? Thanks!

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:11 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
danpoveycommented, Jan 27, 2022

Oh I’m sorry, I see now that I misinterpreted a figure, and that the cuts do overlap after all.

1reaction
pzelaskocommented, Jan 27, 2022

Sure.

  • what was going on here initially (why the error " Supervisions starting before the cut are not supported for ASR" happened)

When he ran .trim_to_supervisions(), it creates one cut per each supervision present in the CutSet. By default, these cuts will contain all supervisions that are overlapping with them by at least 1% of duration. This is to make sure that the users are aware there is possibly overlapping speech in the data, and may either filter these cuts out, or use the flag keep_overlapping=False, in which case there will only be one supervision per cut. I opted not to make the latter case default as it could be disastrous with corpora where overlap is common.

  • Why it was slow when he did:

tedlium = TedLiumAsrDataModule(args)

train_cuts = tedlium.train_cuts()

train_cuts = train_cuts.trim_to_supervisions(keep_overlapping=False)

This was slow because the result of tedlium.train_cuts() contains long cuts (30min?) with a lot of supervisions. The current implementation of trim_to_supervisions creates an interval tree of supervisions for each cut to “quickly” determine which ones are overlapping. Quite possibly it’s not the fastest implementation we can get, but at least it’s not quadratic. There might be some overhead from creating a lot of Python objects too, I’m not sure without a profile.

(and why it helped when he did trim_to_supervisions(keep_overlapping=False) before computing fbank),

It shifted the “cost” of trimming to supervisions to an earlier stage, so that when he runs the training scripts, he simply reads “precomputed trims” of cuts.

  • and why we need “keep_overlapping=False” in the Tedlium setup. I would have thought overlapping supervisions would be rare since it is only one speaker, and it would be harmless to keep small overlaps.

It’s not so easy – if an overlapping supervision goes “outside” of the cut, we are missing a part of audio that may correspond to some text, so we’d be introducing bad training examples. This can be fixed by extending the cut to cover full overlapping supervision (I don’t think we have a method for this yet).

Unfortunately, all it takes is one bad cut to get into these issues, unless we check for these things explicitly in the data prep scripts rather than in K2SpeechRecognitionDataset.

Read more comments on GitHub >

github_iconTop Results From Across the Web

csukuangfj/icefall-asr-librispeech-transducer-stateless-multi ...
This repo provides pre-trained transducer Conformer model for the LibriSpeech dataset using icefall. There are no RNNs in the decoder. The decoder is...
Read more >
TED-LIUM 3 Dataset - Papers With Code
TED-LIUM 3 is an audio dataset collected from TED Talks. ... Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition.
Read more >
Daniel Povey's research works | ‎Beijing Xiaomi Technology ...
The transducer architecture is becoming increasingly popular in the field of speech recognition, because it is naturally streaming as well as high in...
Read more >
arXiv:2104.11127v2 [cs.CL] 9 Jun 2021
The training ran about 20 epochs over the data. The amount of speech training data is modest for an E2E model, and does...
Read more >
Optimizing Inference Serving on Serverless Platforms
serverless auto-scaling forms a complex optimization problem. ... Cloud computing is the most common choice for serving ML ... Due to the stateless....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found