Performance Degradation with ZERO Stage 3
See original GitHub issueHi,
I am trying to benchmark a 10B parameter Huggingface RobertaForMaskedLM
model with both ZERO Stage 2 and ZERO Stage 3 to compare the latency impact of parameter partitioning.
I am seeing much worse performance with Stage 3 than expected however, so want to check if something looks wrong.
|--------------------+-------+--------+-------+---------+--------|
| Description | Model | # p3dn | batch | Samples | TFLOPS |
| | Size | hosts | size | Per Sec | / GPU |
|--------------------+-------+--------+-------+---------+--------|
| Baseline (stage2) | 10B | 16 | 8 | 178 | 44.7 |
| Stage3, no offload | 10B | 16 | 8 | 77 | 19.5 |
| Stage3, no offload | 10B | 8 | 8 | 41 | 21.9 |
| Stage3, no offload | 10B | 4 | 8 | 23 | 23.5 |
| Stage3, no offload | 10B | 2 | 8 | 11.6 | 23.5 |
| Stage3, no offload | 10B | 1 | 8 | OOM | OOM |
|--------------------+-------+--------+-------+---------+--------|
The problem does not seem to be related to network bandwidth, because when I move to p4d machines, which have 4x the bandwidth of p3dn machines (400 Gbps vs 100 Gbps) I see similar degradation:
|--------------------+-------+--------+-------+---------+--------|
| Description | Model | # p4dn | batch | Samples | TFLOPS |
| | Size | hosts | size | Per Sec | / GPU |
|--------------------+-------+--------+-------+---------+--------|
| Baseline (stage2) | 10B | 16 | 8 | 432 | 109 |
| Stage3, no offload | 10B | 4 | 8 | 44 | 44.5 |
|--------------------+-------+--------+-------+---------+--------|
I tried increasing stage3_max_live_parameters
from 1e9 → 2e9, and stage3_prefetch_bucket_size
from 5e8 → 10e8, but neither change impacted performance.
In addition, I ended up adding some time.time()
statements before + after:
a. The blocking fetch()
call: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage3.py#L1520
b. The non-blocking pre-fetch()
call: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage3.py#L1525-L1528
c. The release()
call: https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/zero/stage3.py#L1540
And I noticed counter-intuitively that the majority of time was spent in what is supposed to be the non-blocking pre-fetch call:
Total fetch time = 599.5581150054932 ms;
Total pre-fetch time = 4473.618030548096 ms;
Total release time = 1130.7482719421387 ms
Total time = 6203.9244174957275 ms
In fact after a bit of digging and some additional timing statements added to code, I isolated the place that is causing pre-fetch to take so long to this line: https://github.com/microsoft/DeepSpeed/blob/18a26e8604c4cb8562ed8d57241ca64dbeb4318a/deepspeed/runtime/zero/partition_parameters.py#L798
Any ideas why I am seeing a 2x or bigger drop in performance when moving to stage 3 (compared to stage 2)? And why pre-fetching seems to be taking so much time when it is supposed to be an asynchronous background operation?
Thanks, Stephen
P.S. Details are here: Model config:
RobertaForMaskedLM:
max_position_embeddings: 512
type_vocab_size: 1
num_attention_heads: 40
num_hidden_layers: 30
hidden_size: 5120
intermediate_size: 20480
gradient_checkpointing: true
Zero stage 2 config:
zero_optimization:
stage: 2
overlap_comm: true
Zero stage 3 config:
zero_optimization:
stage: 3
overlap_comm: true
Issue Analytics
- State:
- Created 2 years ago
- Comments:76 (69 by maintainers)
Top GitHub Comments
hey @tjruwase we have published a PR with optimizations and some other improvements here https://github.com/microsoft/DeepSpeed/pull/1453
quick note:
c10::intrusive_ptr<ProcessGroup::Work> ProcessGroupNCCL::allgather_coalesced
is stubbed out in PyTorch - @zarzen and I both have DeepSpeed/Python implementations of this but could benefit everyone to move the implementation to Pytorch/C++ at some pointhttps://github.com/pytorch/pytorch/blob/44922f26f5b1d7b6ee8d217a9d32bc0e40ec47a6/torch/lib/c10d/ProcessGroupNCCL.cpp#L1312-L1318