Train on large textfile
See original GitHub issueHi, I’m trying to train a model from scratch as I want it to generate text in another language (Swedish).
My trainingdata is a large collection of novels, about 22 000 that are all in one single .txt-file delimited by a line with only a <s>
The txt-file is about 300MB in size.
However, both when I try to train it from scratch using the Colab notebook(with a P100 GPU) or locally on my desktop it runs out of memory and crashes.
My desktop has 32GB RAM and a Geforce 2080Ti with 11GB VRAM.
Is there any way to make aitextgen work with 300MB trainingdata?
Are there any parameters I can tweak to have it use less memory?
Should I arrange the trainingdata in another way?
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:11 (3 by maintainers)
Top Results From Across the Web
How to use Pytorch Dataloaders to work with enormously ...
How to use Pytorch Dataloaders to work with enormously large text files ... Pytorch's Dataset and Dataloader classes provide a very convenient way ......
Read more >train Gensim word2vec using large txt file - Stack Overflow
I wanna train word2vec model model using that file but it gives me RAM problem.i dont know how to feed txt file to...
Read more >Load text | TensorFlow Core
TextLineDataset to load text files, and TensorFlow Text APIs, such as text. ... train/python and train/javascript directories contain many text files, ...
Read more >DataLab at Tufts | Text Data
This is probably the best place to start if you want to start playing around with text analysis on large text files. The...
Read more >Large Text File Import - MATLAB Answers - MathWorks
I have 4GB text file included 9 headerlines, 24 columns of raw number data (cca 9M rows). How can I import data to...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@minimaxir, so this thread discusses pretty much the same thing. I have a large file, 1.5 GB specifically of Arabic text that I want to train on.
Is there a way this library could handle a file this size? Like training on one batch at a time or split the file into chunks and feed the trainer gradually?
Update: Got it working by setting batch_size=128, num_workers=2, and running: