[CI][release] long_running_horovod_tune_test failed
See original GitHub issueWhat happened + What you expected to happen
----------------------------------+
2022-10-12 18:03:13,558 ERROR tune.py:773 -- Trials did not complete: [HorovodTrainer_566a7_00000, HorovodTrainer_566a7_00001, HorovodTrainer_566a7_00002, HorovodTrainer_566a7_00003, HorovodTrainer_566a7_00004, HorovodTrainer_566a7_00005, HorovodTrainer_566a7_00006, HorovodTrainer_566a7_00007]
2022-10-12 18:03:13,558 INFO tune.py:778 -- Total run time: 19102.05 seconds (19101.48 seconds for the tuning loop).
Traceback (most recent call last):
File "horovod/workloads/horovod_tune_test.py", line 177, in <module>
assert not result.error
AssertionError
Versions / Dependencies
release branch
Reproduction script
NA
Issue Severity
No response
Issue Analytics
- State:
- Created a year ago
- Comments:7 (7 by maintainers)
Top Results From Across the Web
No results found
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
non release blocker. There are multiple things not going quite right with this particular test. The PR above is intended to fix this test and bring down its cost. Pending review right now. Cost wise, we are down from $60 to $10 per run.
Also exposed through debugging this test is that Torch Dataloader with multiprocessing (
num_workers
) may not (get a chance to) clean up the processes it started because Ray Core just kills actors forcefully when GC kicks in. As a result, GPU memory leaks. This is NOT a regression.For the next step, ML team should work with Core team to make sure that Ray AIR trainer can work nicely with Torch Dataloader if it uses multiple processing. cc @rkooo567
Started this to track.
For curious minds, all the investigation details are summarized here: https://docs.google.com/document/d/1LNCqYuhZkMqrKRuuS6nkSSMq_nTWhgMkP0_7323DYbQ/edit#
Declaured non release blocking