Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Performance Degradation with ZERO Stage 3

See original GitHub issue

Hi,

I am trying to benchmark a 10B parameter Huggingface RobertaForMaskedLM model with both ZERO Stage 2 and ZERO Stage 3 to compare the latency impact of parameter partitioning.

I am seeing much worse performance with Stage 3 than expected however, so want to check if something looks wrong.

|--------------------+-------+--------+-------+---------+--------|
| Description        | Model | # p3dn | batch | Samples | TFLOPS |
|                    | Size  |  hosts |  size | Per Sec |  / GPU |
|--------------------+-------+--------+-------+---------+--------|
| Baseline (stage2)  | 10B   |     16 |     8 |     178 |   44.7 |
| Stage3, no offload | 10B   |     16 |     8 |      77 |   19.5 |
| Stage3, no offload | 10B   |      8 |     8 |      41 |   21.9 |
| Stage3, no offload | 10B   |      4 |     8 |      23 |   23.5 |
| Stage3, no offload | 10B   |      2 |     8 |    11.6 |   23.5 |
| Stage3, no offload | 10B   |      1 |     8 |     OOM |    OOM |
|--------------------+-------+--------+-------+---------+--------|

The problem does not seem to be related to network bandwidth, because when I move to p4d machines, which have 4x the bandwidth of p3dn machines (400 Gbps vs 100 Gbps) I see similar degradation:

|--------------------+-------+--------+-------+---------+--------|
| Description        | Model | # p4dn | batch | Samples | TFLOPS |
|                    | Size  |  hosts |  size | Per Sec |  / GPU |
|--------------------+-------+--------+-------+---------+--------|
| Baseline (stage2)  | 10B   |     16 |     8 |     432 |    109 |
| Stage3, no offload | 10B   |      4 |     8 |      44 |   44.5 |
|--------------------+-------+--------+-------+---------+--------|

I tried increasing stage3_max_live_parameters from 1e9 → 2e9, and stage3_prefetch_bucket_size from 5e8 → 10e8, but neither change impacted performance.

In addition, I ended up adding some time.time() statements before + after: a. The blocking fetch() call: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage3.py#L1520 b. The non-blocking pre-fetch() call: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage3.py#L1525-L1528 c. The release() call: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage3.py#L1540

And I noticed counter-intuitively that the majority of time was spent in what is supposed to be the non-blocking pre-fetch call:

Total fetch time = 599.5581150054932 ms;
Total pre-fetch time = 4473.618030548096 ms;
Total release time = 1130.7482719421387 ms

Total time = 6203.9244174957275 ms

In fact after a bit of digging and some additional timing statements added to code, I isolated the place that is causing pre-fetch to take so long to this line: https://github.com/microsoft/DeepSpeed/blob/18a26e8604c4cb8562ed8d57241ca64dbeb4318a/deepspeed/runtime/zero/partition_parameters.py#L798

Any ideas why I am seeing a 2x or bigger drop in performance when moving to stage 3 (compared to stage 2)? And why pre-fetching seems to be taking so much time when it is supposed to be an asynchronous background operation?

Thanks, Stephen

P.S. Details are here: Model config:

RobertaForMaskedLM:
    max_position_embeddings: 512
    type_vocab_size: 1
    num_attention_heads: 40
    num_hidden_layers: 30
    hidden_size: 5120
    intermediate_size: 20480
    gradient_checkpointing: true

Zero stage 2 config:

  zero_optimization:
    stage: 2
    overlap_comm: true

Zero stage 3 config:

  zero_optimization:
    stage: 3
    overlap_comm: true

Issue Analytics

State:
Created 2 years ago
Comments:76 (69 by maintainers)

Top GitHub Comments

2reactions

jfc4050commented, Oct 12, 2021

hey @tjruwase we have published a PR with optimizations and some other improvements here https://github.com/microsoft/DeepSpeed/pull/1453

2reactions

jfc4050commented, Jul 1, 2021

quick note: c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::allgather_coalesced is stubbed out in PyTorch - @zarzen and I both have DeepSpeed/Python implementations of this but could benefit everyone to move the implementation to Pytorch/C++ at some point

https://github.com/pytorch/pytorch/blob/44922f26f5b1d7b6ee8d217a9d32bc0e40ec47a6/torch/lib/c10d/ProcessGroupNCCL.cpp#L1312-L1318

Top Results From Across the Web

ZeRO — DeepSpeed 0.8.0 documentation - Read the Docs

The Zero Redundancy Optimizer (ZeRO) removes the memory redundancies across data-parallel processes by partitioning the three model states (optimizer states, ...

DeepSpeed Integration - Hugging Face

Performance benefit grows with gradient accumulation steps (more copying between optimizer steps) or GPU count (increased parallelism). ZeRO-3 Config. The ...

Performance Degradation - an overview | ScienceDirect Topics

On multicore processors, an application's contentiousness is defined as the potential performance degradation it can cause to co-running application(s) due to ...

ZeRO & DeepSpeed: New system optimizations enable ...

We are releasing ZeRO as part of DeepSpeed, our high-performance library ... shows how ZeRO (with all three stages) performs a training step...

Microsoft Releases AI Training Library ZeRO-3 Offload - InfoQ

ZeRO -3 Offload increases the memory efficiency of distributed ... The initial release of DeepSpeed included only the first stage, ZeRO-1.