Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Loading Big Model exceed max_memory

See original GitHub issue

System Info

- `Accelerate` version: 0.11.0.dev0
- Platform: Linux-5.4.188+-x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.7.13
- Numpy version: 1.21.6
- PyTorch version (GPU?): 1.11.0+cu113 (True)
- `Accelerate` default config:
	Not found

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

https://colab.research.google.com/drive/1lh9rduNcnGNPHgqWfgTmRK_5k51gF75q?usp=sharing

Expected behavior

When using "max_memory" parameter, the script should only use the specified max memory. However, it exceed the max_memory parameter and consumes the whole memory, then it crashes Colab.

Any idea why it doesn't respect the parameter?

Issue Analytics

State:
Created a year ago
Comments:9 (3 by maintainers)

Top GitHub Comments

1reaction

xnohatcommented, Jul 14, 2022

I success load big model on Colab Pro with this guide from Huggingface

https://huggingface.co/docs/transformers/big_models https://huggingface.co/docs/accelerate/big_modeling

model = AutoModelForCausalLM.from_pretrained("bigscience/bloom-6b3", torch_dtype=torch.float16, use_cache=True, low_cpu_mem_usage=True)

import os
import tempfile

offload_dir='/content/offload'
os.makedirs(offload_dir) if not os.path.exists(offload_dir) else None

with tempfile.TemporaryDirectory() as tmp_dir:
    model.save_pretrained(tmp_dir, max_shard_size="200MB")
    print('Temp Dir Path:', tmp_dir)
    print(sorted(os.listdir(tmp_dir)))
    new_model = AutoModelForCausalLM.from_pretrained(tmp_dir, low_cpu_mem_usage=True, device_map="auto", offload_folder=offload_dir)

0reactions

github-actions[bot]commented, Aug 7, 2022

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.