Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

zero3 hangs in inference

See original GitHub issue

So training works with zero3 and then I do inference calling deepspeed.forward() and while it works on a very small sample, with just slightly bigger sample it hangs with 100% gpu utilization:

Thread 0x00007f57caf71740 (most recent call first):
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/cuda/streams.py", line 95 in synchronize
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/runtime/zero/stage3.py", line 490 in _synchronize_communication
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/runtime/zero/stage3.py", line 406 in fetch_sub_module
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/runtime/zero/stage3.py", line 1139 in pre_sub_module_forward_function
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/runtime/zero/stage3.py", line 1071 in _pre_forward_module_hook
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 881 in _call_impl
  File "/mnt/nvme1/code/huggingface/transformers-ds-zero-3/src/transformers/models/t5/modeling_t5.py", line 451 in project
  File "/mnt/nvme1/code/huggingface/transformers-ds-zero-3/src/transformers/models/t5/modeling_t5.py", line 474 in forward
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 892 in _call_impl
  File "/mnt/nvme1/code/huggingface/transformers-ds-zero-3/src/transformers/models/t5/modeling_t5.py", line 540 in forward
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 892 in _call_impl
  File "/mnt/nvme1/code/huggingface/transformers-ds-zero-3/src/transformers/models/t5/modeling_t5.py", line 633 in forward
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 892 in _call_impl
  File "/mnt/nvme1/code/huggingface/transformers-ds-zero-3/src/transformers/models/t5/modeling_t5.py", line 954 in forward
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 892 in _call_impl
  File "/mnt/nvme1/code/huggingface/transformers-ds-zero-3/src/transformers/models/t5/modeling_t5.py", line 1505 in forward
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 892 in _call_impl
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/runtime/engine.py", line 893 in forward
  File "/home/stas/anaconda3/envs/main-38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 872 in _call_impl
  File "/mnt/nvme1/code/huggingface/transformers-ds-zero-3/src/transformers/trainer_seq2seq.py", line 185 in prediction_step
  File "/mnt/nvme1/code/huggingface/transformers-ds-zero-3/src/transformers/trainer.py", line 1800 in prediction_loop
  File "/mnt/nvme1/code/huggingface/transformers-ds-zero-3/src/transformers/trainer.py", line 1647 in evaluate
  File "/mnt/nvme1/code/huggingface/transformers-ds-zero-3/src/transformers/trainer_seq2seq.py", line 74 in evaluate
  File "examples/seq2seq/run_seq2seq.py", line 607 in main
  File "examples/seq2seq/run_seq2seq.py", line 655 in <module>

the trace is from faulthandler so please read in reverse.

I’m not sure if you have inference tests - may be this can be reproduced with just model.eval()?

Config:

{
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "zero_optimization": {
        "stage": 3,
        "cpu_offload": true,
        "cpu_offload_params": true,
        "cpu_offload_use_pin_memory" : true,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e8,
        "stage3_prefetch_bucket_size": 2e5,
        "stage3_param_persitance_threshold": 1e5,
        "reduce_bucket_size": 3e6,
        "prefetch_bucket_size": 3e6,
        "sub_group_size": 1e6
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": 3e-5,
            "betas": [0.8, 0.999],
            "eps": 1e-8,
            "weight_decay": 3e-7
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 0,
            "warmup_max_lr": 3e-5,
            "warmup_num_steps": 500
        }
    },

    "steps_per_print": 2000,
    "wall_clock_breakdown": false
}

Thanks.

Issue Analytics

State:
Created 3 years ago
Comments:15 (12 by maintainers)

Top GitHub Comments

2reactions

samyamcommented, Mar 16, 2021

Thank you @stas00 for digging into this. I am glad you were able to get to the core of the problem.

I understood the problem.

During inference we have an option to generate predictions which can then be scored against BLEU, etc. Different sequences may take a different number of forward passes to complete this task.

So when one gpu finished generating its predictions quicker than the others - say it decided using a criteria that it’s done at length of 10 tokens, whereas others aren’t done, and say the max_length is 15, they are now stuck waiting for the first gpu to continue running forward but it will not do that.

This makes sense. This is pretty much what I was expecting as well. Since, ZeRO-3 is a single program multiple data (SPMD) approach to parallelism with coordinated data movement, all process must be running the same program, in this case the forward on the model on each process to work correctly.

To ensure that this is so, I hacked the code to complete the while loop till it hit max_length on all gpus repeating last forward call, and the problem went away.

I am not at all sure this hack will be acceptable as:

The code where we run the generate loop - it’s quite a few call frames away from the trainer and as mentioned early doesn’t know anything about such special circumstances or that it’s running under deepspeed (or fairscale), since it was designed to work with any model.

I agree that the hack is limiting but I have a slightly different view on the “designed to work with any model” part. It seems that the code is actually designed to work only with single GPU models, and is limited in that sense. As long as the model is single GPU, it will work, but it will not work with any multi-GPU model regardless of whether it is ZeRO-3 or model parallel (tensor slicing) or pipeline parallel, since each of them requires some form of special treatment that is inherent in the parallelism itself. For example, model parallelism would require the data loader to give the same sample to all GPUs, and pipeline parallelism would require the data loader to give samples only to the first stage GPU.

A potential solution here could be to extend the code to support multi-GPU inference, by allowing for adaptable variations based on the type of parallelism being used?

It wastes resources running dumb forward calls and throw the results away.

This I think can be mitigated to a point that the waste in resource is minimal. Two potential solutions:

If the generate code can support a batch size > 1, then run with a large batch size, all running for max_len. During inference a larger batch will in general give a better throughput, and with a large batch size, the probability of getting a large sequence generated increases so the expected waste in resource will go down. Also a large batch size will significantly reduce the communication overhead of ZeRO-3.
If batch size > 1 is not supported, run all the generation for all the samples one after another until you are done with all the samples before doing anything else. As you noticed, as long as each process is running a forward on something, it will run fine. There will still be some wasted resource at the very end due to difference in the total number of generated tokens across all the queries, but this will be much less than running fake forward for each query.

But I will totally understand if you don’t have any brilliant ideas to how to overcome this hurdle and we will find some way around this.

1reaction

evros-chriscommented, Apr 10, 2022

Thank you @stas00! I have opened a new issue and tagged you here: https://github.com/huggingface/transformers/issues/16688

Top Results From Across the Web

DeepSpeed - Hugging Face

DeepSpeed ZeRO-3 can be used for inference as well, since it allows huge models to be loaded on multiple GPUs, which won't be...

Source code for deepspeed.runtime.zero.stage3

... f"ZeRO-3 supports only float16 or bfloat16 communication_data_type with ... data and `self.average_tensor()` will crash because its params_to_reduce ...

Tweets with replies by Ammar Ahmad Awan (@ammar_awan) / Twitter

Come listen to my SC 22 talk on DeepSpeed Inference at 1.30pm today. ... Here is a recipe to debug a hanging python...

AWS Machine Learning Blog

The library further supports a number of Bayesian inference methods that can be applied to deep neural networks written in Flax. The library...

Providing Evidence for the Null Hypothesis in Functional ...

Bayesian inference allows us to 'reject and accept' the null hypothesis on an ... (2) it detects areas with a trivial effect size...