question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Huge amount of CPU RAM needed during training

See original GitHub issue

Hello team!

With the current version of fairseq we noticed that a huge amount of RAM (CPU RAM, not GPU RAM) is required in order to run the training. Moreover this is correlated to the number of GPU used on the same machine.

So my guess is that the binarized data used for the training is completely loaded in RAM for every GPU process, this will result in having the amount of CPU RAM is roughly: RAM ~= (number of GPUs) * sizeof(binarized data). If this is true, the amount of RAM needed for medium/large training sets is huge (hundreds of GB) wrt size of training set (less than 100 GB).

If this is the case, why can’t we use a memory mapped training set? So that the amount of RAM depends exclusively on sizeof(binarized data) ?

I’m available to work on this if needed, can you please give me some code context, or a good starting point to begin with?

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:22 (22 by maintainers)

github_iconTop GitHub Comments

1reaction
frankangcommented, Dec 28, 2020

Just to provide another workaround for the RAM issue. I wrapped the mmap dataset with a customized dataset and got the following error

"RuntimeError: DataLoader worker (pid xxx) is killed by signal: Segmentation fault.
...
_pickle.UnpicklingError: pickle data was truncated

It is probably caused by some unpickleable data used in my customized dataset class, but I don’t have enough time to look into it, so I tried the dataset-impl=lazy option (with data placed on a SATA SSD), it worked but the speed decreased by 30%. I can clearly see GPUs are not fully saturated from the nvidia-smi command. Then I created a ramfs mount with this link (https://unix.stackexchange.com/questions/66329/creating-a-ram-disk-on-linux), and voila! it runs smoothly. The speed only decreased by about 7%, which is fairly acceptable to me.

1reaction
davidecarosellicommented, May 5, 2019

Hi @myleott yes! I spotted the problem with my previous implementation: I was creating all the tensors at startup, so the problem was the overhead - even with mmap data, the single tensor was requiring too much memory.

This new version creates tensor lazily over a unique mmaped memoryview. I was initially scared about the time “overhead” but surprisingly I measure the same exact wps (word-per-second) of the regular cached version, that’s great!

I have also run some measures of RAM usage, here’s my results:

MODEL SIZE TEST Here I have used a tiny training set, so the RAM usage is due only to the network model itself; this is basically the base RAM consumption independent from training set size.

1 GPU 2 GPUs 4 GPUS 8 GPUS
1931 MB 3884 MB 7646 MB 15469 MB

So we can say that, for a transformer base model, fairseq requires ~1920 MB per GPU

CACHED DATASET This is the same base transformer model training but with a 12.6M lines cached dataset.

1 GPU 2 GPUs 4 GPUS 8 GPUS
6093 MB 9243 MB 15321 MB 27472 MB

By removing the model overhead:

1 GPU 2 GPUs 4 GPUS 8 GPUS
4162 MB 5359 MB 7675 MB 12003 MB

This is the size in RAM of the dataset. Here I see something I did not expect actually. The memory consumption is less than linear with number of GPU, this is not expected. Because every GPU process re-load the dataset entirely, I expected a linear dependency, Maybe some problems in measurements?

MMAP DATASET This is the same base transformer model training but with a 12.6M lines memory-mapped dataset.

1 GPU 2 GPUs 4 GPUS 8 GPUS
2424 MB 4867 MB 9591 MB 19052 MB

By removing the model overhead (we have a 2.9GB of buff/cached memory mapped dataset):

1 GPU 2 GPUs 4 GPUS 8 GPUS
493 MB 983 MB 1945 MB 3583 MB

So here we see some savings. I’m not sure why I still have a ~493Mb per GPU (all dataset is memory mapped, so it should not appear in resident memory). I think this is still some model-dependant data structure.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Huge amount of CPU RAM needed during training #574
If this is true, the amount of RAM needed for medium/large training sets is huge (hundreds of GB) wrt size of training set...
Read more >
CPU RAM Usage Keeps Growing as Training One Cycle
I might be seeing things but it seems that RAM usage climbs very slowly through the epoch - when I train with 200k...
Read more >
Which hardware components (CPU, RAM, GC, etc.) are ...
First, for starting; you do not need any special hardware - any recent laptop with at least 4 GB RAM and a fairly...
Read more >
Predicting CPU and GPU memory requirements of DNN training
The answer of @ik_vision describes how to estimate the memory space needed for storing the weights, but you also need to store the ......
Read more >
Performance and Scalability
During training your model can require more GPU memory than is available or be very slow to train and when you deploy it...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found