Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Get stuck in DynamicBucketing Sampler

See original GitHub issue

I am refactoring icefall to use lazy cutset with dynamic bucketing sampler everywhere in https://github.com/k2-fsa/icefall/pull/397

For the following command run in the librispeech recipe directory using the above PR:

  ./pruned_transducer_stateless3/train.py \
    --world-size 1 \
    --num-epochs 30 \
    --start-epoch 0 \
    --exp-dir pruned_transducer_stateless3/exp \
    --full-libri 0 \
    --max-duration 100 \
    --giga-prob 0.2

the training process seems to get stuck inside the dynamic bucketing sampler.

The log output is Screen Shot 2022-06-05 at 23 26 18

I am using py-spy to find where it gets stuck:

watch -n 0.5 py-spy dump --pid 308949 --native

The output is

https://user-images.githubusercontent.com/5284924/172058000-fa268eca-5139-4edd-982f-7c0de189bb55.mov

10 minutes have passed but nothing changes.

PS: I am using the latest master of lhotse.

Issue Analytics

State:
Created a year ago
Comments:12 (5 by maintainers)

Top GitHub Comments

1reaction

danpoveycommented, Jun 6, 2022

That was with an older Lhotse. With the latest version of Lhotse, I have verified that the number of frames does not depend on the number of workers, it is 944034.00. From lhotse cut describe data/fbank/cuts_dev-clean.json.gz , and same for dev-other, it seems the total duration of valid set is 5.4+5.1 hours; multiplying by 3600 seconds per hour and 25 frames per second, that should give 945000 frames. So the length seems correct; I suppose what may have been happening before is the BucketingSampler may have been discarding some utterances.

0reactions

pzelaskocommented, Jun 6, 2022

That’s possible, I recall merging 2 PRs fairly recently that were fixing some data loss in BucketingSampler.