"wall_clock_breakdown": true and overlap_comm: true
See original GitHub issueHi, I have a question regarding how deepspeed measures the communication time. I see that there is a timer that counts the time for allreduce as in https://github.com/microsoft/DeepSpeed/blob/81aeea361da3936b875a678b9cb44596800510b5/deepspeed/runtime/engine.py#L865-L874. But when I go into this function, I found it eventually goes into https://github.com/microsoft/DeepSpeed/blob/81aeea361da3936b875a678b9cb44596800510b5/deepspeed/runtime/zero/stage2.py#L468-L509, which only performs cuda synchronization when overlap_comm=True
. If I remember correctly, pytorch backward is a blocking operation, and the communication finishes before we enter self.timers(‘backward_allreduce_microstep’).start() .
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (6 by maintainers)
Top Results From Across the Web
ImprovingLife The Original Real Moving Gear Wall Clock ...
ImprovingLife The Original Real Moving Gear Wall Clock Vintage Industrial Oversized Rustic Farmhouse (24 inch (60cm),Gold Antique) ; Qty:1 ; Grey and White....
Read more >Overlappers: They start a new relationship before breaking up ...
Let's be real, though: some people use knowledge of a possible imminent breakup to be 'open' to new possibilities.
Read more >Clocks - Solvang Antiques - Pinterest
Solvang Antiques is a clock collector's dream come true, a rare opportunity for antique lovers to discover, enjoy, acquire and learn about antique...
Read more >2022-23 MSHSAA Official Handbook
MSHSAA STANDARDIZED CALENDAR. WEEK. NO. 2022-2023. 2023-2024. 2024-2025. 2025-2026. Seasonal Allowance. 1. 7/3—7/9. 7/2 —7/8. 7/7—7/13.
Read more >List of The Real Ghostbusters episodes - Wikipedia
The animated television series The Real Ghostbusters premiered on ABC on September 13, 1986. It continued airing weekly until the series conclusion on ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thanks. This explains. I think it may be helpful to add some comments around that allreduce call, otherwise it is misleading. And, since deepspeed uses its own launcher, I cannot directly just use
nsys --profile deepspeed <args>
for profiling. mpirun https://github.com/microsoft/DeepSpeed/issues/461 also has issue. Is there a way that I can use Nsight to profile the job? Recently, I came across an issue that backward is 26 times more expensive than the forward for the 10B model training. In general, the cost of the backward is only 2x times of that of the forward (it can be more if the workers communicate inside the backward, but it still shouldn’t be 26x times). This is strange as I am using 128 A100 GPUs with 4 AWS EFA NICs enabled (400Gb/s bandwidth). So the network should not be a problem and I need to use nsys to profile the training job to figure out why it happens.@szhengac could you please share how you get nsight working with deepspeed launcher? The command I am using is as below, and the report doesn’t contain any cuda trace information.
nsys profile --trace=cuda deepspeed