Some problems when loading the TedLium3 dataset for transducer-stateless training
See original GitHub issueCurrently, I am trying to build a transducer-stateless recipe based on Tedlium3 for icefall. This is the PR. (https://github.com/k2-fsa/icefall/pull/183). This PR shows the concrete codes for processing and loading the tedlium dataset. In train.py, we also use the function remove_short_and_long_utt
for filtering.
When I use the following codes for loading the data in train.py:
tedlium = TedLiumAsrDataModule(args)
train_cuts = tedlium.train_cuts()
There is an error:
2022-01-26 17:06:37,134 INFO [train.py:577] About to create model
2022-01-26 17:06:37,804 INFO [train.py:581] Number of model parameters: 84007924
2022-01-26 17:06:41,550 INFO [asr_datamodule.py:341] About to get train cuts
2022-01-26 17:06:53,750 INFO [train.py:618] Before removing short and long utterances: 7053
2022-01-26 17:06:53,751 INFO [train.py:619] After removing short and long utterances: 0
2022-01-26 17:06:53,751 INFO [train.py:620] Removed 7053 utterances (100.00000%)
2022-01-26 17:06:53,751 INFO [asr_datamodule.py:176] About to get Musan cuts
2022-01-26 17:06:55,585 INFO [asr_datamodule.py:183] Enable MUSAN
2022-01-26 17:06:55,586 INFO [asr_datamodule.py:208] Enable SpecAugment
2022-01-26 17:06:55,586 INFO [asr_datamodule.py:209] Time warp factor: 80
2022-01-26 17:06:55,586 INFO [asr_datamodule.py:224] About to create train dataset
2022-01-26 17:06:55,586 INFO [asr_datamodule.py:252] Using BucketingSampler.
Traceback (most recent call last):
File "transducer_stateless/train.py", line 733, in <module>
main()
File "transducer_stateless/train.py", line 726, in main
run(rank=0, world_size=1, args=args)
File "transducer_stateless/train.py", line 622, in run
train_dl = tedlium.train_dataloaders(train_cuts)
File "/ceph-meixu/luomingshuang/icefall/egs/tedlium3/ASR/transducer_stateless/asr_datamodule.py", line 253, in train_dataloaders
train_sampler = BucketingSampler(
File "/ceph-meixu/luomingshuang/anaconda3/envs/k2-python/lib/python3.8/site-packages/lhotse-1.0.0.dev0+git.6a3192a.clean-py3.8.egg/lhotse/dataset/sampling/bucketing.py", line 108, in __init__
File "/ceph-meixu/luomingshuang/anaconda3/envs/k2-python/lib/python3.8/site-packages/lhotse-1.0.0.dev0+git.6a3192a.clean-py3.8.egg/lhotse/dataset/sampling/bucketing.py", line 392, in create_buckets_equal_duration
File "/ceph-meixu/luomingshuang/anaconda3/envs/k2-python/lib/python3.8/site-packages/lhotse-1.0.0.dev0+git.6a3192a.clean-py3.8.egg/lhotse/dataset/sampling/bucketing.py", line 430, in _create_buckets_equal_duration_single
IndexError: pop from empty list
So I read the cuts_train.json.gz with vim data/fbank/cuts_train.json.gz
, and shows as follows:
As shows in the above picture, the each cut’s duration is too long. So the samples are filtered.
For fixing this issue, I try to use the following codes for loading the data:
tedlium = TedLiumAsrDataModule(args)
train_cuts = tedlium.train_cuts()
train_cuts = train_cuts.trim_to_supervisions()
There is an error:
2022-01-26 17:17:15,419 INFO [train.py:577] About to create model
2022-01-26 17:17:16,063 INFO [train.py:581] Number of model parameters: 84007924
2022-01-26 17:17:19,781 INFO [asr_datamodule.py:341] About to get train cuts
2022-01-26 17:18:35,665 INFO [train.py:618] Before removing short and long utterances: 804789
2022-01-26 17:18:35,665 INFO [train.py:619] After removing short and long utterances: 801989
2022-01-26 17:18:35,665 INFO [train.py:620] Removed 2800 utterances (0.34792%)
2022-01-26 17:18:35,665 INFO [asr_datamodule.py:176] About to get Musan cuts
2022-01-26 17:18:38,257 INFO [asr_datamodule.py:183] Enable MUSAN
2022-01-26 17:18:38,257 INFO [asr_datamodule.py:208] Enable SpecAugment
2022-01-26 17:18:38,257 INFO [asr_datamodule.py:209] Time warp factor: 80
2022-01-26 17:18:38,257 INFO [asr_datamodule.py:224] About to create train dataset
2022-01-26 17:18:38,257 INFO [asr_datamodule.py:252] Using BucketingSampler.
2022-01-26 17:18:41,597 INFO [asr_datamodule.py:268] About to create train dataloader
2022-01-26 17:18:41,597 INFO [asr_datamodule.py:348] About to get dev cuts
2022-01-26 17:18:41,695 INFO [asr_datamodule.py:289] About to create dev dataset
2022-01-26 17:18:41,696 INFO [asr_datamodule.py:308] About to create dev dataloader
2022-01-26 17:18:41,697 INFO [train.py:685] Sanity check -- see if any of the batches in epoch 0 would cause OOM.
Traceback (most recent call last):
File "transducer_stateless/train.py", line 733, in <module>
main()
File "transducer_stateless/train.py", line 726, in main
run(rank=0, world_size=1, args=args)
File "transducer_stateless/train.py", line 628, in run
scan_pessimistic_batches_for_oom(
File "transducer_stateless/train.py", line 690, in scan_pessimistic_batches_for_oom
batch = train_dl.dataset[cuts]
File "/ceph-meixu/luomingshuang/anaconda3/envs/k2-python/lib/python3.8/site-packages/lhotse-1.0.0.dev0+git.6a3192a.clean-py3.8.egg/lhotse/dataset/speech_recognition.py", line 99, in __getitem__
File "/ceph-meixu/luomingshuang/anaconda3/envs/k2-python/lib/python3.8/site-packages/lhotse-1.0.0.dev0+git.6a3192a.clean-py3.8.egg/lhotse/dataset/speech_recognition.py", line 206, in validate_for_asr
AssertionError: Supervisions starting before the cut are not supported for ASR (sup id: ClayShirky_2005G-126, cut id: 7840abc4-ea04-003f-4314-ee1381d764dd)
As shows in the above picture, some short cut’s start
time may be negative (<0).
For fixing this issue, I try to use the following codes for loading the data:
tedlium = TedLiumAsrDataModule(args)
train_cuts = tedlium.train_cuts()
train_cuts = train_cuts.trim_to_supervisions(keep_overlapping=False)
It can load data with the above codes normally. But it takes much time to generate a batch in GPU. And the volatile GPU-Util is 0 for a long time.
I am trying to change long cuts to short cuts before computing the fbank feature. (In compute_fbank_tedlium.py)
cut_set = CutSet.from_manifests(
recordings=m["recordings"],
supervisions=m["supervisions"],
).trim_to_supervisions(keep_overlapping=False)
Are there some advices for this issue? Thanks!
Issue Analytics
- State:
- Created 2 years ago
- Comments:11 (1 by maintainers)
Top GitHub Comments
Oh I’m sorry, I see now that I misinterpreted a figure, and that the cuts do overlap after all.
Sure.
When he ran
.trim_to_supervisions()
, it creates one cut per each supervision present in the CutSet. By default, these cuts will contain all supervisions that are overlapping with them by at least 1% of duration. This is to make sure that the users are aware there is possibly overlapping speech in the data, and may either filter these cuts out, or use the flagkeep_overlapping=False
, in which case there will only be one supervision per cut. I opted not to make the latter case default as it could be disastrous with corpora where overlap is common.This was slow because the result of
tedlium.train_cuts()
contains long cuts (30min?) with a lot of supervisions. The current implementation oftrim_to_supervisions
creates an interval tree of supervisions for each cut to “quickly” determine which ones are overlapping. Quite possibly it’s not the fastest implementation we can get, but at least it’s not quadratic. There might be some overhead from creating a lot of Python objects too, I’m not sure without a profile.It shifted the “cost” of trimming to supervisions to an earlier stage, so that when he runs the training scripts, he simply reads “precomputed trims” of cuts.
It’s not so easy – if an overlapping supervision goes “outside” of the cut, we are missing a part of audio that may correspond to some text, so we’d be introducing bad training examples. This can be fixed by extending the cut to cover full overlapping supervision (I don’t think we have a method for this yet).
Unfortunately, all it takes is one bad cut to get into these issues, unless we check for these things explicitly in the data prep scripts rather than in K2SpeechRecognitionDataset.