How do I log the result of `wall_clock_breakdown` from all ranks?
See original GitHub issueHi!
I’m trying out the pipeline parallelism example in microsoft/DeepSpeedExamples
here on DeepSpeed commit ebed51df787579b8b3f836bb155ce0bf97f4ab66.
I turned on wall_clock_breakdown
in ds_config.json
and have four pipeline stages across four A40 GPUs. Training runs well.
However, I was wondering if there’s an option to have all four pipeline stages report their time breakdowns. This is an excerpt of the output I’m currently getting, which looks like only rank 0 is reporting the computation & communication time for its own stage (i.e. first stage of the pipeline):
steps: 20 loss: 4.6178 iter time (s): 0.994 samples/sec: 257.601
[2022-08-25 02:58:04,341] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | pipe_send_output: 41.71 | pipe_recv_grad: 1725.64
[2022-08-25 02:58:13,587] [INFO] [logging.py:68:log_dist] [Rank 0] step=30, skipped=0, lr=[0.001], mom=[[0.9, 0.999]]
[2022-08-25 02:58:14,298] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | batch_input: 209.59 | forward_microstep: 2785.85 | backward_microstep: 5153.22 | backward_inner_microstep: 5153.13 | backward_allreduce_microstep: 0.00 | step_microstep: 12.50
[2022-08-25 02:58:14,299] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward: 2785.85 | backward: 5153.18 | backward_inner: 5153.08 | backward_allreduce: 0.00 | step: 12.50
steps: 30 loss: 5.8922 iter time (s): 0.995 samples/sec: 257.271
[2022-08-25 02:58:14,300] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | pipe_send_output: 41.68 | pipe_recv_grad: 1725.21
[2022-08-25 02:58:23,540] [INFO] [logging.py:68:log_dist] [Rank 0] step=40, skipped=0, lr=[0.001], mom=[[0.9, 0.999]]
[2022-08-25 02:58:24,251] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | batch_input: 204.02 | forward_microstep: 2786.27 | backward_microstep: 5153.85 | backward_inner_microstep: 5153.75 | backward_allreduce_microstep: 0.00 | step_microstep: 12.51
[2022-08-25 02:58:24,252] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward: 2786.28 | backward: 5153.80 | backward_inner: 5153.71 | backward_allreduce: 0.00 | step: 12.51
Thanks a lot.
Issue Analytics
- State:
- Created a year ago
- Comments:5
Top Results From Across the Web
Registration, Ranking, and Results (R3) System - NRMP
The R3 system can be accessed directly at https://r3.nrmp.org/viewLoginPage or by clicking the yellow “Log In/Register” button that is on every page of...
Read more >RANK function - Microsoft Support
The rank of a number is its size relative to other values in a list. ... For formulas to show results, select them,...
Read more >Ranking Results – How Google Search Works
Information such as your location, past Search history, and Search settings all help us to ensure your results are what is most useful...
Read more >1. Basics of the Rankings - World Athletics
The ranking system is based on the two main elements of all Athletics performances: the measured results of athletes (result score) and their...
Read more >Result Rankings Guide | Swiftype Documentation
To start ranking your results, login to Site Search and click into your Engine. ... Show All Results to Users: Shows all results,...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@jaywonchung, agreed on the difficulty and the thoughts as well. We would certainly need a way to have users specify which ranks they wanted logging from to reduce the total amount of data logged, and definitely off/rank 0 only by default. I’ll chat with more folks here and add what thoughts we have and what work we will try to do. Thanks for your comments!
Thanks for looking into this.
I think it’s not too trivial to do this nicely because the semantics of distributed training matters. That is, when doing pure data parallel training (which is what DeepSpeed and ZeRO are originally for), the wall clock breakdowns across every rank are going to be similar by design and printing out information from all ranks is just going to confuse people. I suppose this is why all-rank wall clock breakdown is currently not supported. In contrast, while doing pipeline parallel training, each rank is doing different computation, and measurements from different ranks may have differences. Plus, data parallel and pipeline parallel can be combined, so even when pipeline parallel is active, it doesn’t make sense to log from all ranks. Only the ranks that have DP rank zero (i.e., one rank per pipeline parallel stage) should print, for example.
Also some thoughts in bullets:
deepspeed.runtime.pipe.topology.ProcessTopology
to pick out global ranks that have DP rank 0 and only print out info in those ranks.