Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Will GPT2 775M model finetune on 16G VRAM and 24G RAM? (answer is 'yes', now what about 1.3B?)

See original GitHub issue

I’m using deepspeed 0.4.5+c543a41 (tried 0.4.4 as well) and transformers-4.9.1. Attempted earlier versions of everything as well.

I was able to finetune GPT2 355M with sequence length 2048, batch size 1 on this configuration (it is Google Colab), all fit in VRAM, without DeepSpeed. (strange thing that I tried also FP16 here (OM 1), to see its effect, and it won’t do much - there were 1024MB free VRAM left with FP16, and some 625MB free without).

But, when I try it with 775M sequence length 2048, batch size 1, I use FP16 and CPU offloading, I’m getting out of CPU memory on the first step of training. I tried stage 2 zero_optimization with "allgather_bucket_size": 2e8 and "reduce_bucket_size": 2e8 but that leaves me with 9GB free VRAM when I get process killed by Linux for eating all the RAM (I confirm that RAM is gone in htop). So, that’s a waste of VRAM.

I tried to set both _size to 2e9, that way I get only ~300MB of free VRAM left, but it didn’t help with RAM.

Here are the settings:

"zero_optimization": {
     "stage": 2,
     "cpu_offload": true,
     "allgather_partitions": true,
     "allgather_bucket_size": 2e9,
     "reduce_scatter": true,
     "reduce_bucket_size": 2e9,
     "overlap_comm": true,
     "contiguous_gradients": true
  },

I also tried to use --sparse-mode alternating with the config:

  "sparse_attention": {
    "mode": "fixed",
    "block": 16,
    "different_layout_per_head": true,
    "num_local_blocks": 8,
    "num_global_blocks": 1,
    "attention": "unidirectional",
    "horizontal_global_attention": false,
    "num_different_global_patterns": 8
  }

But that only leads to loss optimization autoscale goes down to 32768 instead of 1048576, and same RAM depletion.

I feel like if 335M model fine tuning does fit in 15GB VRAM, there should be a way to fit 775M in 16GB VRAM + 24GB RAM. But what should I try?

Issue Analytics

State:
Created 2 years ago
Comments:9 (3 by maintainers)

Top GitHub Comments

1reaction

Artyrmcommented, Aug 21, 2021

@tjruwase thank you for your reply. And sorry for the delay, in fact I found out about your reply only yesterday - my mail suddenly tagged all github as spam. I was thinking out my update on the issue here, I hope to post it in a couple of days.

0reactions

Artyrmcommented, May 16, 2022

@tjruwase Hi! Thank you for your efforts. I have pretty much given up with that for now. I believe, my last reports still hold true, that’s all I know. Think we should close it?