Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DynamicBatchSampler

See original GitHub issue

Hi,

I’m trying to understand the implementation of DynamicBatchSampler. I would like to know, in _get_boundaries_through_warping, why would you use the lognorm of s=1 to get the quantiles and linearly scale up to max_batch_length. What about using lognorm.fit ?

cc: @popcornell

Thanks in advance !

Issue Analytics

State:
Created a year ago
Reactions:1
Comments:10 (5 by maintainers)

Top GitHub Comments

2reactions

anautschcommented, May 18, 2022

Hi @bofenghuang yeah, that warping idea is on me. It’s resolving time resolution in the latent statistical space rather than needing to define it explicitly for every audio collection that one fancies at a time. I write the following bulk just to make sure we are on the same page - from your message I assume you know it intuitively already, just so others can follow when reading this when archived.

Please take a look at our tutorial for context. The prequel to your question:

long audios are a burden, random batches run easily oom for limited VRAM
reducing batch size to 1 takes for ever in training, it wastes time and power consumption (lots of padding also)
so, how do we create batches that need the least padding possible while maximising training efficiency?
the go-to approach was: let’s create buckets that have properties and we treat these properties then - words - so - one bucket takes only one long audio, others some medium audios, and others all the small ones - to reduce padding, tiny audios are put in here & there to reduce padding -> note: VRAM determines what’s the max length of a bucket with the longest audio

Now, how to get these buckets with their fancy properties? Without the latent space/warping approach, one would need to define: 0 to 0.2; 0.2 to 0.6; 0.6 to 1.2; … or to script some exponential growth there. Rationale for exponential growth: most datasets are log-normal distributed when it comes to audio duration. So, the buckets are projected to treatment in a linear space. Therefore, the log-normal distribution is used.

About your question. Which log-normal distribution to use is perhaps somewhat arbitrary. 😉

Here, the goal was to get an initial handle on:

max_batch_length which represents the VRAM limit
num_quantiles which represents the targeted resolution in the latent space

For this latent space, your question is about:

why use standard parameters for the assumed distribution?
why not using a good fit of the actual duration distribution?

The answer might be depressingly simple: to get the PR & tutorial out for later discussions like this one.

My gut feeling is that a distribution fit should play out better than some arbitrarily assumed distribution. Would you be up to dive into tests on this topic? It w/could make sense also to move away then from the log-normal assumption and use a general fit - in the end, what matters here is that quantiles become linear in their handling through warping distributions. Another issue might also rest in: for the rest of the dynamic batching, why not have simply three or five bucket types and sort the rest in from long to short audios?

1reaction

anautschcommented, May 20, 2022

@bofenghuang neat! It’s encouraging to see your enthusiasm 😄 Any choice is arbitrary, having some facts selection to it, doesn’t make choices systematic 😉

@popcornell worked a lot on the final tutorial!

As you demonstrated, the num_buckets helps DynamicBatching to do something but it is far from what one would think should happen (10% of the created buckets are actually used, the rest remains unfilled). Therefore, I’d be curious if it makes sense to have any distributional assumption at all - or to treat this entirely on the categorical level (have 4, 5, 6, … bucket types to be filled whatever they are - the limit is VRAM). What I try to say:

we need to question what is the underlying problem actually
which parameters (among those we know/not know) are critical to solve it?
the actual problem might be much more simple but statistical understanding of its inner workings might help to find a more adequate solution

That’s also what your findings on kmeans support: fitting distributions can help but in the end, we get a dataset and that’s our entire population to take care about during operation. (New dataset, new task - everything back to start.) Yet, what does the batch creation of DynamicBatching imply to the overall training? How to test that for one and many datasets - how do we have the guarantees that we need to have?

If the number of total batches created is small and padding is small - how much random permutation of the files in these batches is then possible? Or would there only be one solution in how to draw a batch after kmeans?

What’s your take: is there benefit in giving many choices here on which DynamicBatching approach to use? Are there “relevant” ones? // The goal is to provide useful tools which make a complex issue to be handled intuitively.

% True samples & % of padding say the same thing - % of padding could be more useful to develop intuition here but it’s not what one thinks of first (so both were in the tutorial). About the Sampler initialization time - it’s relevant to decompose the Total time to understand what’s going on internally.

DynamicBatching has a few use cases which need better testing if they are fulfilled:

low VRAM: create batches that are not running oom
low VRAM: don’t be wasteful - reduce padding
low VRAM: see as much data as possible; don’t remove too much bc of low VRAM
low VRAM: different epochs should feature different batch samplings = random permutations of files
low VRAM: while long audios might fill up an entire bucket, two half-long audios of that might also run oom - how to instantiate buckets properly for all audios to be processed
high VRAM: less padding means fewer batches means faster training

@TParcollet @popcornell please add to that list what I forgot - there might be more requirements to DynamicBatching ^^"

The dimensions to investigate a better DynamicBatching need to also work on:

mini LibriSpeech
full LibriSpeech
CommonVoice (some language sets there are at a few MB; some GB; lots GB)
…

What we observe on small datasets might not hold on large datasets. How about discussing next week a strategy for testing and developing DynamicBatching further?