zero3 hangs in inference
See original GitHub issueSo training works with zero3 and then I do inference calling deepspeed.forward()
and while it works on a very small sample, with just slightly bigger sample it hangs with 100% gpu utilization:
Thread 0x00007f57caf71740 (most recent call first):
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/cuda/streams.py", line 95 in synchronize
File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/runtime/zero/stage3.py", line 490 in _synchronize_communication
File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/runtime/zero/stage3.py", line 406 in fetch_sub_module
File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/runtime/zero/stage3.py", line 1139 in pre_sub_module_forward_function
File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/runtime/zero/stage3.py", line 1071 in _pre_forward_module_hook
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 881 in _call_impl
File "/mnt/nvme1/code/huggingface/transformers-ds-zero-3/src/transformers/models/t5/modeling_t5.py", line 451 in project
File "/mnt/nvme1/code/huggingface/transformers-ds-zero-3/src/transformers/models/t5/modeling_t5.py", line 474 in forward
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 892 in _call_impl
File "/mnt/nvme1/code/huggingface/transformers-ds-zero-3/src/transformers/models/t5/modeling_t5.py", line 540 in forward
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 892 in _call_impl
File "/mnt/nvme1/code/huggingface/transformers-ds-zero-3/src/transformers/models/t5/modeling_t5.py", line 633 in forward
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 892 in _call_impl
File "/mnt/nvme1/code/huggingface/transformers-ds-zero-3/src/transformers/models/t5/modeling_t5.py", line 954 in forward
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 892 in _call_impl
File "/mnt/nvme1/code/huggingface/transformers-ds-zero-3/src/transformers/models/t5/modeling_t5.py", line 1505 in forward
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 892 in _call_impl
File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/runtime/engine.py", line 893 in forward
File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 872 in _call_impl
File "/mnt/nvme1/code/huggingface/transformers-ds-zero-3/src/transformers/trainer_seq2seq.py", line 185 in prediction_step
File "/mnt/nvme1/code/huggingface/transformers-ds-zero-3/src/transformers/trainer.py", line 1800 in prediction_loop
File "/mnt/nvme1/code/huggingface/transformers-ds-zero-3/src/transformers/trainer.py", line 1647 in evaluate
File "/mnt/nvme1/code/huggingface/transformers-ds-zero-3/src/transformers/trainer_seq2seq.py", line 74 in evaluate
File "examples/seq2seq/run_seq2seq.py", line 607 in main
File "examples/seq2seq/run_seq2seq.py", line 655 in <module>
the trace is from faulthandler
so please read in reverse.
I’m not sure if you have inference tests - may be this can be reproduced with just model.eval()
?
Config:
{
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"zero_optimization": {
"stage": 3,
"cpu_offload": true,
"cpu_offload_params": true,
"cpu_offload_use_pin_memory" : true,
"overlap_comm": true,
"contiguous_gradients": true,
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e8,
"stage3_prefetch_bucket_size": 2e5,
"stage3_param_persitance_threshold": 1e5,
"reduce_bucket_size": 3e6,
"prefetch_bucket_size": 3e6,
"sub_group_size": 1e6
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": 3e-5,
"betas": [0.8, 0.999],
"eps": 1e-8,
"weight_decay": 3e-7
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": 0,
"warmup_max_lr": 3e-5,
"warmup_num_steps": 500
}
},
"steps_per_print": 2000,
"wall_clock_breakdown": false
}
Thanks.
Issue Analytics
- State:
- Created 3 years ago
- Comments:15 (12 by maintainers)
Top Results From Across the Web
DeepSpeed - Hugging Face
DeepSpeed ZeRO-3 can be used for inference as well, since it allows huge models to be loaded on multiple GPUs, which won't be...
Read more >Source code for deepspeed.runtime.zero.stage3
... f"ZeRO-3 supports only float16 or bfloat16 communication_data_type with ... data and `self.average_tensor()` will crash because its params_to_reduce ...
Read more >Tweets with replies by Ammar Ahmad Awan (@ammar_awan) / Twitter
Come listen to my SC 22 talk on DeepSpeed Inference at 1.30pm today. ... Here is a recipe to debug a hanging python...
Read more >AWS Machine Learning Blog
The library further supports a number of Bayesian inference methods that can be applied to deep neural networks written in Flax. The library...
Read more >Providing Evidence for the Null Hypothesis in Functional ...
Bayesian inference allows us to 'reject and accept' the null hypothesis on an ... (2) it detects areas with a trivial effect size...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thank you @stas00 for digging into this. I am glad you were able to get to the core of the problem.
This makes sense. This is pretty much what I was expecting as well. Since, ZeRO-3 is a single program multiple data (SPMD) approach to parallelism with coordinated data movement, all process must be running the same program, in this case the forward on the model on each process to work correctly.
I agree that the hack is limiting but I have a slightly different view on the “designed to work with any model” part. It seems that the code is actually designed to work only with single GPU models, and is limited in that sense. As long as the model is single GPU, it will work, but it will not work with any multi-GPU model regardless of whether it is ZeRO-3 or model parallel (tensor slicing) or pipeline parallel, since each of them requires some form of special treatment that is inherent in the parallelism itself. For example, model parallelism would require the data loader to give the same sample to all GPUs, and pipeline parallelism would require the data loader to give samples only to the first stage GPU.
A potential solution here could be to extend the code to support multi-GPU inference, by allowing for adaptable variations based on the type of parallelism being used?
This I think can be mitigated to a point that the waste in resource is minimal. Two potential solutions:
Thank you @stas00! I have opened a new issue and tagged you here: https://github.com/huggingface/transformers/issues/16688