Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BART on CNN/DM : how to train on small GPU ?

See original GitHub issue

I’m trying to reproduce the CNN/DM results of BART.
Unfortunately, I don’t have access to good GPU. I only have access to 2 GPU with 8GB of memory.

I updated the finetuning cmd accordingly (changing UPDATE_FREQ) for the number of GPU.

But I have issue for the memory of GPU : I tried reducing MAX_TOKENS to 512 in order to make the data fit in my 8GB, but I receive following error :

AssertionError: sentence at index 227550 of size 728 exceeds max_tokens limit of 512!

If I set MAX_TOKENS to 1024, I have a CUDA out of memory error (expected).

What modification do I need to do to be able to finetune the model on small GPU (8GB) ?

@ngoyal2707 @yinhanliu

Issue Analytics

State:
Created 4 years ago
Comments:21 (9 by maintainers)

Top GitHub Comments

5reactions

astariulcommented, Dec 24, 2019

@wonjininfo

On my side, I trained BART on 4 x 11GB GPU.
As mentioned earlier, 11GB is not enough to fit 1 sample (1024 tokens). So I used --memory-efficient-fp16. Even though my GPU does not support FP16 training, this reduced the required memory by almost half.

But still, it was not enough, so I reduced the MAX_TOKENS from 1024 to 928. With these parameters, I could fit 1 sample in my GPU.

With MAX_TOKENS = 928 and --memory-efficient-fp16, I got following results :

R1 = 43.61 R2 = 20.90 RL = 40.41

It’s a bit lower than normal BART, but it was expected due to my parameters.

I didn’t try training the model with lower number of MAX_TOKENS, as I could fit 1 sample already with 928.

Merry christmas 😃

4reactions

myleottcommented, Nov 25, 2019

Note that --memory-efficient-fp16 can produce worse results, especially with small batch sizes. You’re probably better off either decreasing the batch size and/or training in FP32, since FP16 can actually use more memory since it needs to maintain both an FP32 and FP16 copy of the model.