Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Sometimes batches are created which do not have same number of supervisions and inputs

See original GitHub issue

Traceback (most recent call last):
  File "train.py", line 1019, in <module>
    main()
  File "train.py", line 1012, in main
    run(rank=0, world_size=1, args=args)
  File "train.py", line 867, in run
    scan_pessimistic_batches_for_oom(
  File "train.py", line 977, in scan_pessimistic_batches_for_oom
    loss, _ = compute_loss(
  File "train.py", line 542, in compute_loss
    simple_loss, pruned_loss = model(
  File "/home/rudolf/miniconda3/envs/k2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/mnt/ssd1/team/tmp/rudolf/icefall/ch/model.py", line 117, in forward
    assert x.size(0) == x_lens.size(0) == y.dim0
AssertionError

I did do trim_to_supervisions(). I’m trying to track this down.

On the side, I’m surprised by the what I find unusual design of data loading (not using batch_sampler argument of dataloader, sampler instead of dataset containing data, collation happening in dataset.getitem etc.).

edit: think it’s because I forgot --discard-overlapping

edit2: still failing, some cuts now dont have supervisions

this is how I created the cuts manifest:

    recording_set, supervision_set, _ = load_kaldi_data_dir(kdata, 16000, num_jobs=4)
    cuts = CutSet.from_manifests(recordings=recording_set, supervisions=supervision_set)
    cuts = cuts.trim_to_supervisions(keep_overlapping=False)
    cuts = cuts.truncate(offset_type='start', max_duration=60.0, keep_excessive_supervisions=False)
    cuts = cuts.compute_and_store_features(Fbank(), storage_path=outf + '-feats', num_jobs=6, storage_type=LilcomChunkyWriter)
    cuts.to_file(outf + '.jsonl.gz')

edit3: it has to do with the duration of the recording being shorter than the duration of a supervision, presumably somehow because of using sox bla.mp3 .. |

edit4: Okay so for some reason the sox resampling causes the file to be slightly shorter. This means the interval gotten from the [kaldi] segments is too long, so later when trimming it gets dropped. Not sure what the right fix for this, my quickfix is making the supervision duration not exceed the recording, if the difference is bigger then 0.1 then an error is thrown.

Issue Analytics

State:
Created a year ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

pzelaskocommented, Jul 26, 2022

Regarding the design, some motivation is provided here: https://lhotse.readthedocs.io/en/latest/datasets.html#about-lhotses-datasets-and-samplers

As for Kaldi imports, segments file, Sox, and MP3, I’ve found in the past that the duration information is quite unreliable. Usually some manual/data-specific step is required to fix it. If you have any suggestions how we can improve this experience, it will be welcome.

0reactions

RuABrauncommented, Sep 20, 2022

I will try that out!

Top Results From Across the Web

Batch Jobs and Job Scripting

Batch jobs are resource provisions that run applications on nodes away from the user and do not require supervision or interaction. Batch jobs...

What is Batch Processing? - Enterprise Cloud Computing ...

Batch processing is the method computers use to periodically complete high-volume, repetitive data jobs. Certain data processing tasks, such as backups, ...

OM Chapter 6 Homework Flashcards - Quizlet

Which of the following is NOT an input to process selection? A - Technological change. B - Product/Service Design C - Marketing Strategy...

Batch Control Part 1: Models and Terminology

Once established in a steady operating state, the nature of the process is not dependent on the length of time of operation. Start-ups,...

Part 3: Chapter 14 - Upstream Manual - KnowVA

BATCH LIMIT SETTING: A number specifying how many batches will be displayed in the batch window. This applies to the batch, fix index,...