Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

"wall_clock_breakdown": true and overlap_comm: true

See original GitHub issue

Hi, I have a question regarding how deepspeed measures the communication time. I see that there is a timer that counts the time for allreduce as in https://github.com/microsoft/DeepSpeed/blob/81aeea361da3936b875a678b9cb44596800510b5/deepspeed/runtime/engine.py#L865-L874. But when I go into this function, I found it eventually goes into https://github.com/microsoft/DeepSpeed/blob/81aeea361da3936b875a678b9cb44596800510b5/deepspeed/runtime/zero/stage2.py#L468-L509, which only performs cuda synchronization when overlap_comm=True. If I remember correctly, pytorch backward is a blocking operation, and the communication finishes before we enter self.timers(‘backward_allreduce_microstep’).start() .

Issue Analytics

State:
Created 3 years ago
Comments:7 (6 by maintainers)

Top GitHub Comments

1reaction

szhengaccommented, Dec 24, 2020

Thanks. This explains. I think it may be helpful to add some comments around that allreduce call, otherwise it is misleading. And, since deepspeed uses its own launcher, I cannot directly just use nsys --profile deepspeed <args> for profiling. mpirun https://github.com/microsoft/DeepSpeed/issues/461 also has issue. Is there a way that I can use Nsight to profile the job? Recently, I came across an issue that backward is 26 times more expensive than the forward for the 10B model training. In general, the cost of the backward is only 2x times of that of the forward (it can be more if the workers communicate inside the backward, but it still shouldn’t be 26x times). This is strange as I am using 128 A100 GPUs with 4 AWS EFA NICs enabled (400Gb/s bandwidth). So the network should not be a problem and I need to use nsys to profile the training job to figure out why it happens.

10.0.85.39: rank=0 time (ms) | forward_microstep: 70.96 | backward_microstep: 1805.11 | backward_inner_microstep: 1749.92 | backward_allreduce_microstep: 55.12 | step_microstep: 9.21
10.0.85.39: rank=0 time (ms) | forward: 70.94 | backward: 1805.08 | backward_inner: 1749.89 | backward_allreduce: 55.10 | step: 9.18

0reactions

kehuanfengcommented, Dec 10, 2021

Nevermind, I have managed to fix the mpirun and used nsight to profile the training. I put the profiling result in a new issue #620.

@szhengac could you please share how you get nsight working with deepspeed launcher? The command I am using is as below, and the report doesn’t contain any cuda trace information. nsys profile --trace=cuda deepspeed

Top Results From Across the Web

ImprovingLife The Original Real Moving Gear Wall Clock ...

ImprovingLife The Original Real Moving Gear Wall Clock Vintage Industrial Oversized Rustic Farmhouse (24 inch (60cm),Gold Antique) ; Qty:1 ; Grey and White....

Overlappers: They start a new relationship before breaking up ...

Let's be real, though: some people use knowledge of a possible imminent breakup to be 'open' to new possibilities.

Clocks - Solvang Antiques - Pinterest

Solvang Antiques is a clock collector's dream come true, a rare opportunity for antique lovers to discover, enjoy, acquire and learn about antique...

2022-23 MSHSAA Official Handbook

MSHSAA STANDARDIZED CALENDAR. WEEK. NO. 2022-2023. 2023-2024. 2024-2025. 2025-2026. Seasonal Allowance. 1. 7/3—7/9. 7/2 —7/8. 7/7—7/13.

List of The Real Ghostbusters episodes - Wikipedia

The animated television series The Real Ghostbusters premiered on ABC on September 13, 1986. It continued airing weekly until the series conclusion on ......