question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

SortedDL for contiguous LM

See original GitHub issue

Hi there, I am currently implementing LM re-training of a RoBERTa model using the Trainer API. Since I have a huge training corpus, I was wondering if there is a functionality in the Trainer or the corresponding DataCollatorForLanguageModeling that allows for sorted batching as in fastai?

More precisely, I would like to feed in all my training data as a contiguous text stream and let the respective functions handle sorted batching irrespective of the sequence length of the individual sequences.

Best, Simon

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
sguggercommented, Mar 10, 2021

I think you may be referring to the LM DataLoader. This kind of preprocessing is done using the datasets library on our side. Take a look at the run_clm or run_mlm examples (in run_mlm the part that is not in the block “line_by_line”) or the language modeling notebook to see how.

0reactions
simonschoecommented, Mar 26, 2021

This makes entirely sense, thanks for lifting this barrier in my head!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Text data - fastai
The LMDataLoader will concatenate all texts (maybe shuffle d) in one big stream, split it in bs contiguous sentences, then go through those...
Read more >
(PDF) Division of Labor: Roles of Groucho and CtBP in Notch ...
(G) Transcriptional changes of FACS-sorted Dl-GFP cells in control and gro-RNAi guts. After gro depletion, many known Notch target genes are ...
Read more >
Advanced Wireless Technology for Ultrahigh Data Rate ...
Impact of Adjacent Channel and Cochannel Interfer- ... [17] P. F. M. Smulders and L. M. Correia, “Characterisation of.
Read more >
Fumitoshi Matsuno Shun-ichi Azuma Masahito Yamamoto ...
The sorted Dl−h is stored in the variable buf sub. Next, we start to assemble each subset di ... adjacent number of in-shape...
Read more >
NOTE TO USERS - Bac-Lac.gc.ca
adjacent lowlands (Teller, 1985). Klassen (1 983b) notes that by 7.7 ka, Lake Agassiz had ... Distance (lm) dong a b e perpendicular...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found