SortedDL for contiguous LM
See original GitHub issueHi there,
I am currently implementing LM re-training of a RoBERTa model using the Trainer
API. Since I have a huge training corpus, I was wondering if there is a functionality in the Trainer
or the corresponding DataCollatorForLanguageModeling
that allows for sorted batching as in fastai
?
More precisely, I would like to feed in all my training data as a contiguous text stream and let the respective functions handle sorted batching irrespective of the sequence length of the individual sequences.
Best, Simon
Issue Analytics
- State:
- Created 3 years ago
- Comments:8 (4 by maintainers)
Top Results From Across the Web
Text data - fastai
The LMDataLoader will concatenate all texts (maybe shuffle d) in one big stream, split it in bs contiguous sentences, then go through those...
Read more >(PDF) Division of Labor: Roles of Groucho and CtBP in Notch ...
(G) Transcriptional changes of FACS-sorted Dl-GFP cells in control and gro-RNAi guts. After gro depletion, many known Notch target genes are ...
Read more >Advanced Wireless Technology for Ultrahigh Data Rate ...
Impact of Adjacent Channel and Cochannel Interfer- ... [17] P. F. M. Smulders and L. M. Correia, “Characterisation of.
Read more >Fumitoshi Matsuno Shun-ichi Azuma Masahito Yamamoto ...
The sorted Dl−h is stored in the variable buf sub. Next, we start to assemble each subset di ... adjacent number of in-shape...
Read more >NOTE TO USERS - Bac-Lac.gc.ca
adjacent lowlands (Teller, 1985). Klassen (1 983b) notes that by 7.7 ka, Lake Agassiz had ... Distance (lm) dong a b e perpendicular...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I think you may be referring to the LM DataLoader. This kind of preprocessing is done using the
datasets
library on our side. Take a look at the run_clm or run_mlm examples (in run_mlm the part that is not in the block “line_by_line”) or the language modeling notebook to see how.This makes entirely sense, thanks for lifting this barrier in my head!