Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Loading mt5-xxl rasies error related to Pytorch/TF incompatiblity

See original GitHub issue

Environment info

transformers version: 4.10.0
Platform: Linux-3.10.0-1160.36.2.el7.x86_64-x86_64-with-glibc2.17
Python version: 3.9.1
PyTorch version (GPU?): 1.9.0+cu102 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: yes

Who can help

Model hub: @patrickvonplaten

Information

Model I am using: mt5-xxl

The problem arises when using:

the official example scripts: examples/pytorch/translation/run_translation.py
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: example translation task given in the documentation
my own task or dataset:

To reproduce

Steps to reproduce the behavior.

I am using deepspeed to load the model into memory.

deepspeed transformers/examples/pytorch/translation/run_translation.py     \
  --do_train     --model_name_or_path google/mt5-xxl     \
  --source_lang en     --target_lang ro     \
  --dataset_name wmt16     --dataset_config_name ro-en \
  --output_dir exp/33-a     --per_device_train_batch_size=4  \
  --per_device_eval_batch_size=4     --overwrite_output_dir     --deepspeed ds_config.json

ds_config.json

{ 
  "train_batch_size": "auto",
  "zero_optimization": {
    "stage": 3,
    "stage3_max_live_parameters": 1e9
  }
}

Expected behavior

The model should be trained, but I get the following error:

OSError: Unable to load weights from pytorch checkpoint file for 'google/mt5-xxl' at '/net/people/plgapohl/.cache/huggingface/transformers/b36655ddd18a5fda6384477b693eb51ddc8d5bfd2e9a91ed202317f660041716.c24311abc84f0f3a6095195722be4735840971f245dfb6ea3a407c9bed537390'If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.

~Changing 342 line in run_translation.py to =True fixes the problem.~

~I believe the model on the hub named pythorch_model.bin is in fact a TF model.~

Issue Analytics

State:
Created 2 years ago
Comments:6 (5 by maintainers)

Top GitHub Comments

1reaction

apohllocommented, Sep 20, 2021

Yes, I can confirm this is a CPU RAM related issue. I managed to load t5-xxl on single V100 GPU 32GB VRAM with 70GB CPU RAM, and the same model on two V100 GPU 32GB VRAM with more than 200GB CPU RAM (according to the command which tracks the maximum memory consumption). In that scenario I switched to ZeRO 2 instead of ZeRO 3.

We can close the issue. Maybe tagging it with model parallel and deepspeed would improve the issue discoverability?

1reaction

patrickvonplatencommented, Sep 19, 2021

Hey @apohllo,

I think the problem is that you don’t have enough CPU RAM to instantiate the model. mt5-xxl requires around 95 GB of RAM to be loaded into memory currently. We are working on an improved loading implementation for large files though that should reduce this number to something much closer to the model file size (48GB) - see: https://github.com/huggingface/transformers/issues/13548