question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Train on large textfile

See original GitHub issue

Hi, I’m trying to train a model from scratch as I want it to generate text in another language (Swedish). My trainingdata is a large collection of novels, about 22 000 that are all in one single .txt-file delimited by a line with only a <s> The txt-file is about 300MB in size. However, both when I try to train it from scratch using the Colab notebook(with a P100 GPU) or locally on my desktop it runs out of memory and crashes. My desktop has 32GB RAM and a Geforce 2080Ti with 11GB VRAM.

Is there any way to make aitextgen work with 300MB trainingdata? Are there any parameters I can tweak to have it use less memory? Should I arrange the trainingdata in another way? image

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:1
  • Comments:11 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
mohatahercommented, Jun 12, 2020

@minimaxir, so this thread discusses pretty much the same thing. I have a large file, 1.5 GB specifically of Arabic text that I want to train on.

Is there a way this library could handle a file this size? Like training on one batch at a time or split the file into chunks and feed the trainer gradually?

1reaction
zephyocommented, May 31, 2020

Update: Got it working by setting batch_size=128, num_workers=2, and running:

python
import torch
torch.cuda.empty_cache()

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to use Pytorch Dataloaders to work with enormously ...
How to use Pytorch Dataloaders to work with enormously large text files ... Pytorch's Dataset and Dataloader classes provide a very convenient way ......
Read more >
train Gensim word2vec using large txt file - Stack Overflow
I wanna train word2vec model model using that file but it gives me RAM problem.i dont know how to feed txt file to...
Read more >
Load text | TensorFlow Core
TextLineDataset to load text files, and TensorFlow Text APIs, such as text. ... train/python and train/javascript directories contain many text files, ...
Read more >
DataLab at Tufts | Text Data
This is probably the best place to start if you want to start playing around with text analysis on large text files. The...
Read more >
Large Text File Import - MATLAB Answers - MathWorks
I have 4GB text file included 9 headerlines, 24 columns of raw number data (cca 9M rows). How can I import data to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found