Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Train on large textfile

See original GitHub issue

Hi, I’m trying to train a model from scratch as I want it to generate text in another language (Swedish). My trainingdata is a large collection of novels, about 22 000 that are all in one single .txt-file delimited by a line with only a <s> The txt-file is about 300MB in size. However, both when I try to train it from scratch using the Colab notebook(with a P100 GPU) or locally on my desktop it runs out of memory and crashes. My desktop has 32GB RAM and a Geforce 2080Ti with 11GB VRAM.

Is there any way to make aitextgen work with 300MB trainingdata? Are there any parameters I can tweak to have it use less memory? Should I arrange the trainingdata in another way?

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:11 (3 by maintainers)

Top GitHub Comments

1reaction

mohatahercommented, Jun 12, 2020

@minimaxir, so this thread discusses pretty much the same thing. I have a large file, 1.5 GB specifically of Arabic text that I want to train on.

Is there a way this library could handle a file this size? Like training on one batch at a time or split the file into chunks and feed the trainer gradually?

1reaction

zephyocommented, May 31, 2020

Update: Got it working by setting batch_size=128, num_workers=2, and running:

python
import torch
torch.cuda.empty_cache()

Top Results From Across the Web

How to use Pytorch Dataloaders to work with enormously ...

How to use Pytorch Dataloaders to work with enormously large text files ... Pytorch's Dataset and Dataloader classes provide a very convenient way ......

train Gensim word2vec using large txt file - Stack Overflow

I wanna train word2vec model model using that file but it gives me RAM problem.i dont know how to feed txt file to...

Load text | TensorFlow Core

TextLineDataset to load text files, and TensorFlow Text APIs, such as text. ... train/python and train/javascript directories contain many text files, ...

DataLab at Tufts | Text Data

This is probably the best place to start if you want to start playing around with text analysis on large text files. The...

Large Text File Import - MATLAB Answers - MathWorks

I have 4GB text file included 9 headerlines, 24 columns of raw number data (cca 9M rows). How can I import data to...