Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

SortedDL for contiguous LM

See original GitHub issue

Hi there, I am currently implementing LM re-training of a RoBERTa model using the Trainer API. Since I have a huge training corpus, I was wondering if there is a functionality in the Trainer or the corresponding DataCollatorForLanguageModeling that allows for sorted batching as in fastai?

More precisely, I would like to feed in all my training data as a contiguous text stream and let the respective functions handle sorted batching irrespective of the sequence length of the individual sequences.

Best, Simon

Issue Analytics

State:
Created 3 years ago
Comments:8 (4 by maintainers)

Top GitHub Comments

1reaction

sguggercommented, Mar 10, 2021

I think you may be referring to the LM DataLoader. This kind of preprocessing is done using the datasets library on our side. Take a look at the run_clm or run_mlm examples (in run_mlm the part that is not in the block “line_by_line”) or the language modeling notebook to see how.

0reactions

simonschoecommented, Mar 26, 2021

This makes entirely sense, thanks for lifting this barrier in my head!