Loading mt5-xxl rasies error related to Pytorch/TF incompatiblity
See original GitHub issueEnvironment info
transformers
version: 4.10.0- Platform: Linux-3.10.0-1160.36.2.el7.x86_64-x86_64-with-glibc2.17
- Python version: 3.9.1
- PyTorch version (GPU?): 1.9.0+cu102 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: yes
Who can help
Model hub: @patrickvonplaten
Information
Model I am using: mt5-xxl
The problem arises when using:
- the official example scripts: examples/pytorch/translation/run_translation.py
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: example translation task given in the documentation
- my own task or dataset:
To reproduce
Steps to reproduce the behavior.
I am using deepspeed to load the model into memory.
deepspeed transformers/examples/pytorch/translation/run_translation.py \
--do_train --model_name_or_path google/mt5-xxl \
--source_lang en --target_lang ro \
--dataset_name wmt16 --dataset_config_name ro-en \
--output_dir exp/33-a --per_device_train_batch_size=4 \
--per_device_eval_batch_size=4 --overwrite_output_dir --deepspeed ds_config.json
ds_config.json
{
"train_batch_size": "auto",
"zero_optimization": {
"stage": 3,
"stage3_max_live_parameters": 1e9
}
}
Expected behavior
The model should be trained, but I get the following error:
OSError: Unable to load weights from pytorch checkpoint file for 'google/mt5-xxl' at '/net/people/plgapohl/.cache/huggingface/transformers/b36655ddd18a5fda6384477b693eb51ddc8d5bfd2e9a91ed202317f660041716.c24311abc84f0f3a6095195722be4735840971f245dfb6ea3a407c9bed537390'If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.
~Changing 342 line in run_translation.py
to =True
fixes the problem.~
~I believe the model on the hub named pythorch_model.bin is in fact a TF model.~
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (5 by maintainers)
Top Results From Across the Web
Use tensorflow and pytorch in same code caused error
I have a project that need to use tensorflow model and pytorch model at the same time. When tf model do inference, it...
Read more >RepL4NLP 2021 The 6th Workshop on Representation ...
Welcome to the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021)! The workshop was co-located with the Joint Conference of the ...
Read more >The multilingual-t5 from google-research - GithubHelp
mT5-XXL, 85.0, 90.0 ... I want to load the Huggingface checkpoint using model ... it loads t5-base, resulting the below error due to...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Yes, I can confirm this is a CPU RAM related issue. I managed to load
t5-xxl
on single V100 GPU 32GB VRAM with 70GB CPU RAM, and the same model on two V100 GPU 32GB VRAM with more than 200GB CPU RAM (according to the command which tracks the maximum memory consumption). In that scenario I switched to ZeRO 2 instead of ZeRO 3.We can close the issue. Maybe tagging it with model parallel and deepspeed would improve the issue discoverability?
Hey @apohllo,
I think the problem is that you don’t have enough CPU RAM to instantiate the model.
mt5-xxl
requires around 95 GB of RAM to be loaded into memory currently. We are working on an improved loading implementation for large files though that should reduce this number to something much closer to the model file size (48GB) - see: https://github.com/huggingface/transformers/issues/13548