question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Loading mt5-xxl rasies error related to Pytorch/TF incompatiblity

See original GitHub issue

Environment info

  • transformers version: 4.10.0
  • Platform: Linux-3.10.0-1160.36.2.el7.x86_64-x86_64-with-glibc2.17
  • Python version: 3.9.1
  • PyTorch version (GPU?): 1.9.0+cu102 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: yes

Who can help

Model hub: @patrickvonplaten

Information

Model I am using: mt5-xxl

The problem arises when using:

  • the official example scripts: examples/pytorch/translation/run_translation.py
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: example translation task given in the documentation
  • my own task or dataset:

To reproduce

Steps to reproduce the behavior.

I am using deepspeed to load the model into memory.

deepspeed transformers/examples/pytorch/translation/run_translation.py     \
  --do_train     --model_name_or_path google/mt5-xxl     \
  --source_lang en     --target_lang ro     \
  --dataset_name wmt16     --dataset_config_name ro-en \
  --output_dir exp/33-a     --per_device_train_batch_size=4  \
  --per_device_eval_batch_size=4     --overwrite_output_dir     --deepspeed ds_config.json

ds_config.json

{ 
  "train_batch_size": "auto",
  "zero_optimization": {
    "stage": 3,
    "stage3_max_live_parameters": 1e9
  }
}

Expected behavior

The model should be trained, but I get the following error:

OSError: Unable to load weights from pytorch checkpoint file for 'google/mt5-xxl' at '/net/people/plgapohl/.cache/huggingface/transformers/b36655ddd18a5fda6384477b693eb51ddc8d5bfd2e9a91ed202317f660041716.c24311abc84f0f3a6095195722be4735840971f245dfb6ea3a407c9bed537390'If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.

~Changing 342 line in run_translation.py to =True fixes the problem.~

~I believe the model on the hub named pythorch_model.bin is in fact a TF model.~

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
apohllocommented, Sep 20, 2021

Yes, I can confirm this is a CPU RAM related issue. I managed to load t5-xxl on single V100 GPU 32GB VRAM with 70GB CPU RAM, and the same model on two V100 GPU 32GB VRAM with more than 200GB CPU RAM (according to the command which tracks the maximum memory consumption). In that scenario I switched to ZeRO 2 instead of ZeRO 3.

We can close the issue. Maybe tagging it with model parallel and deepspeed would improve the issue discoverability?

1reaction
patrickvonplatencommented, Sep 19, 2021

Hey @apohllo,

I think the problem is that you don’t have enough CPU RAM to instantiate the model. mt5-xxl requires around 95 GB of RAM to be loaded into memory currently. We are working on an improved loading implementation for large files though that should reduce this number to something much closer to the model file size (48GB) - see: https://github.com/huggingface/transformers/issues/13548

Read more comments on GitHub >

github_iconTop Results From Across the Web

Use tensorflow and pytorch in same code caused error
I have a project that need to use tensorflow model and pytorch model at the same time. When tf model do inference, it...
Read more >
RepL4NLP 2021 The 6th Workshop on Representation ...
Welcome to the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021)! The workshop was co-located with the Joint Conference of the ...
Read more >
The multilingual-t5 from google-research - GithubHelp
mT5-XXL, 85.0, 90.0 ... I want to load the Huggingface checkpoint using model ... it loads t5-base, resulting the below error due to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found