question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[CI][release] long_running_horovod_tune_test failed

See original GitHub issue

What happened + What you expected to happen

Build failure

Cluster

----------------------------------+
2022-10-12 18:03:13,558 ERROR tune.py:773 -- Trials did not complete: [HorovodTrainer_566a7_00000, HorovodTrainer_566a7_00001, HorovodTrainer_566a7_00002, HorovodTrainer_566a7_00003, HorovodTrainer_566a7_00004, HorovodTrainer_566a7_00005, HorovodTrainer_566a7_00006, HorovodTrainer_566a7_00007]
2022-10-12 18:03:13,558 INFO tune.py:778 -- Total run time: 19102.05 seconds (19101.48 seconds for the tuning loop).

Traceback (most recent call last):
  File "horovod/workloads/horovod_tune_test.py", line 177, in <module>
    assert not result.error
AssertionError


Versions / Dependencies

release branch

Reproduction script

NA

Issue Severity

No response

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

2reactions
xwjiang2010commented, Oct 21, 2022

non release blocker. There are multiple things not going quite right with this particular test. The PR above is intended to fix this test and bring down its cost. Pending review right now. Cost wise, we are down from $60 to $10 per run.

Also exposed through debugging this test is that Torch Dataloader with multiprocessing (num_workers) may not (get a chance to) clean up the processes it started because Ray Core just kills actors forcefully when GC kicks in. As a result, GPU memory leaks. This is NOT a regression.

For the next step, ML team should work with Core team to make sure that Ray AIR trainer can work nicely with Torch Dataloader if it uses multiple processing. cc @rkooo567

Started this to track.

For curious minds, all the investigation details are summarized here: https://docs.google.com/document/d/1LNCqYuhZkMqrKRuuS6nkSSMq_nTWhgMkP0_7323DYbQ/edit#

1reaction
rickyyxcommented, Oct 20, 2022

Declaured non release blocking

Read more comments on GitHub >

github_iconTop Results From Across the Web

No results found

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found