Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How do I log the result of `wall_clock_breakdown` from all ranks?

See original GitHub issue

Hi!

I’m trying out the pipeline parallelism example in microsoft/DeepSpeedExamples here on DeepSpeed commit ebed51df787579b8b3f836bb155ce0bf97f4ab66.

I turned on wall_clock_breakdown in ds_config.json and have four pipeline stages across four A40 GPUs. Training runs well.

However, I was wondering if there’s an option to have all four pipeline stages report their time breakdowns. This is an excerpt of the output I’m currently getting, which looks like only rank 0 is reporting the computation & communication time for its own stage (i.e. first stage of the pipeline):

steps: 20 loss: 4.6178 iter time (s): 0.994 samples/sec: 257.601
[2022-08-25 02:58:04,341] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | pipe_send_output: 41.71 | pipe_recv_grad: 1725.64
[2022-08-25 02:58:13,587] [INFO] [logging.py:68:log_dist] [Rank 0] step=30, skipped=0, lr=[0.001], mom=[[0.9, 0.999]]
[2022-08-25 02:58:14,298] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | batch_input: 209.59 | forward_microstep: 2785.85 | backward_microstep: 5153.22 | backward_inner_microstep: 5153.13 | backward_allreduce_microstep: 0.00 | step_microstep: 12.50
[2022-08-25 02:58:14,299] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward: 2785.85 | backward: 5153.18 | backward_inner: 5153.08 | backward_allreduce: 0.00 | step: 12.50
steps: 30 loss: 5.8922 iter time (s): 0.995 samples/sec: 257.271
[2022-08-25 02:58:14,300] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | pipe_send_output: 41.68 | pipe_recv_grad: 1725.21
[2022-08-25 02:58:23,540] [INFO] [logging.py:68:log_dist] [Rank 0] step=40, skipped=0, lr=[0.001], mom=[[0.9, 0.999]]
[2022-08-25 02:58:24,251] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | batch_input: 204.02 | forward_microstep: 2786.27 | backward_microstep: 5153.85 | backward_inner_microstep: 5153.75 | backward_allreduce_microstep: 0.00 | step_microstep: 12.51
[2022-08-25 02:58:24,252] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward: 2786.28 | backward: 5153.80 | backward_inner: 5153.71 | backward_allreduce: 0.00 | step: 12.51

Thanks a lot.

Issue Analytics

State:
Created a year ago
Comments:5

Top GitHub Comments

1reaction

loadamscommented, Nov 18, 2022

@jaywonchung, agreed on the difficulty and the thoughts as well. We would certainly need a way to have users specify which ranks they wanted logging from to reduce the total amount of data logged, and definitely off/rank 0 only by default. I’ll chat with more folks here and add what thoughts we have and what work we will try to do. Thanks for your comments!

1reaction

jaywonchungcommented, Nov 18, 2022

Thanks for looking into this.

I think it’s not too trivial to do this nicely because the semantics of distributed training matters. That is, when doing pure data parallel training (which is what DeepSpeed and ZeRO are originally for), the wall clock breakdowns across every rank are going to be similar by design and printing out information from all ranks is just going to confuse people. I suppose this is why all-rank wall clock breakdown is currently not supported. In contrast, while doing pipeline parallel training, each rank is doing different computation, and measurements from different ranks may have differences. Plus, data parallel and pipeline parallel can be combined, so even when pipeline parallel is active, it doesn’t make sense to log from all ranks. Only the ranks that have DP rank zero (i.e., one rank per pipeline parallel stage) should print, for example.

Also some thoughts in bullets:

All-rank wall clock breakdown should be an off-by-default feature since it’s going to lead to an explosion of logging outputs.
How can we aggregate all wall clock breakdown information across multiple nodes? Do we want to even support that?
Maybe you can utilize deepspeed.runtime.pipe.topology.ProcessTopology to pick out global ranks that have DP rank 0 and only print out info in those ranks.

Top Results From Across the Web

Registration, Ranking, and Results (R3) System - NRMP

The R3 system can be accessed directly at https://r3.nrmp.org/viewLoginPage or by clicking the yellow “Log In/Register” button that is on every page of...

RANK function - Microsoft Support

The rank of a number is its size relative to other values in a list. ... For formulas to show results, select them,...

Ranking Results – How Google Search Works

Information such as your location, past Search history, and Search settings all help us to ensure your results are what is most useful...

1. Basics of the Rankings - World Athletics

The ranking system is based on the two main elements of all Athletics performances: the measured results of athletes (result score) and their...

Result Rankings Guide | Swiftype Documentation

To start ranking your results, login to Site Search and click into your Engine. ... Show All Results to Users: Shows all results,...