Get stuck in DynamicBucketing Sampler
See original GitHub issueI am refactoring icefall to use lazy cutset with dynamic bucketing sampler everywhere in https://github.com/k2-fsa/icefall/pull/397
For the following command run in the librispeech recipe directory using the above PR:
./pruned_transducer_stateless3/train.py \
--world-size 1 \
--num-epochs 30 \
--start-epoch 0 \
--exp-dir pruned_transducer_stateless3/exp \
--full-libri 0 \
--max-duration 100 \
--giga-prob 0.2
the training process seems to get stuck inside the dynamic bucketing sampler.
The log output is
I am using py-spy
to find where it gets stuck:
watch -n 0.5 py-spy dump --pid 308949 --native
The output is
https://user-images.githubusercontent.com/5284924/172058000-fa268eca-5139-4edd-982f-7c0de189bb55.mov
10 minutes have passed but nothing changes.
PS: I am using the latest master of lhotse.
Issue Analytics
- State:
- Created a year ago
- Comments:12 (5 by maintainers)
Top Results From Across the Web
PyTorch Datasets - Sampler - lhotse's documentation!
Another strategy — used in BucketingSampler — will first group the cuts of similar durations into buckets, and then randomly select a bucket...
Read more >Can stuck bucket syndrome finally be fixed, please?
When it hits that time, just give the result of whatever noise sampling it got to. Dynamic splitting each bucket after its already...
Read more >How to optimize buckets in V-Ray part 1 - Option analysis
... what is bucket image sampler ? 03:04 - testing bucket sizes performance 03:37 - why bigger buckets render faster? 04:43 - dynamic...
Read more >Efficient Dynamic Batching of Large Datasets with Infinibatch
We will explore how to efficiently batch large datasets with varied sequence length for training using infinibatch. The focus will be on ...
Read more >Online Assessment Interview/Hackathon Question, Solution ...
Level up your coding skills and quickly land a job. This is the best place to expand your knowledge and get prepared for...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
That was with an older Lhotse. With the latest version of Lhotse, I have verified that the number of frames does not depend on the number of workers, it is 944034.00. From
lhotse cut describe data/fbank/cuts_dev-clean.json.gz
, and same for dev-other, it seems the total duration of valid set is 5.4+5.1 hours; multiplying by 3600 seconds per hour and 25 frames per second, that should give 945000 frames. So the length seems correct; I suppose what may have been happening before is the BucketingSampler may have been discarding some utterances.That’s possible, I recall merging 2 PRs fairly recently that were fixing some data loss in BucketingSampler.