question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Core] [Nightly] [Flaky] `many_drivers` test failed

See original GitHub issue

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Ray Core

What happened + What you expected to happen

(run_driver pid=3156456) ray.exceptions.RayTaskError(RayOutOfMemoryError): ray::f() (pid=3167980, ip=172.31.62.177)
--
  | (run_driver pid=3156456) ray._private.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node ip-172-31-62-177 is used (29.1 / 30.57 GB). The top 10 memory consumers are:
  | (run_driver pid=3156456)
  | (run_driver pid=3156456) PID	MEM	COMMAND
  | (run_driver pid=3156456) 755	21.37GiB	/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/dash
  | (run_driver pid=3156456) 729	0.41GiB	/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/s
  | (run_driver pid=3156456) 800	0.11GiB	/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/log_m
  | (run_driver pid=3156456) 51	0.09GiB	/home/ray/anaconda3/bin/python /home/ray/anaconda3/bin/anyscale session web_terminal_server --deploy
  | (run_driver pid=3156456) 378	0.09GiB	/home/ray/anaconda3/bin/python /home/ray/anaconda3/bin/anyscale session auth_start
  | (run_driver pid=3156456) 691	0.09GiB	python workloads/many_drivers.py
  | (run_driver pid=3156456) 834	0.09GiB	/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/agen
  | (run_driver pid=3156456) 890	0.08GiB	/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/agen
  | (run_driver pid=3156456) 1014	0.08GiB	/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/agen
  | (run_driver pid=3156456) 953	0.08GiB	/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/agenTraceback (most recent call last):
  | File "workloads/many_drivers.py", line 95, in <module>
  | ray.get(ready_id)
  | File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
  | return func(*args, **kwargs)
  | File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1742, in get
  | raise value.as_instanceof_cause()
  | ray.exceptions.RayTaskError(CalledProcessError): ray::run_driver() (pid=3136504, ip=172.31.62.177)
  | File "workloads/many_drivers.py", line 80, in run_driver
  | output = run_string_as_driver(driver_script)
  | subprocess.CalledProcessError: Command '['/home/ray/anaconda3/bin/python', '-']' returned non-zero exit status 1.

It seems the cluster is OOMing and the dashboard is to blame.

Versions / Dependencies

master

Reproduction script

Run the weekly tests. example output: https://buildkite.com/ray-project/periodic-ci/builds/2247#309c72ce-6ceb-4e86-9470-3c429fb1bf81

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:36 (36 by maintainers)

github_iconTop GitHub Comments

3reactions
iychengcommented, Jan 26, 2022

Lower the priority to P1. The reason it failed is that the master is faster after this PR.

Memory leak still exists in the dashboard. But without this PR, every worker basically needs to import a lot of things and it slows the driver’s performance.

We can see from this log

Iteration 2221:
  - Iteration time: 91.22895789146423.
  - Absolute time: 1639977131.3817475.
  - Total elapsed time: 82505.75467967987.
Iteration 2222:
  - Iteration time: 89.64419484138489.
  - Absolute time: 1639977221.0259423.
  - Total elapsed time: 82595.39887452126.
Iteration 2223:
  - Iteration time: 70.22041773796082.
  - Absolute time: 1639977291.24636.
  - Total elapsed time: 82665.61929225922.
Iteration 2224:
  - Iteration time: 141.53751230239868.
  - Absolute time: 1639977432.7838724.
  - Total elapsed time: 82807.15680456161.

It cost a lot of time to finish one iteration and within 24h, it only finished 2k iterations.

With this PR, it’s like:

Iteration 4001:
  - Iteration time: 0.95485520362854.
  - Absolute time: 1643077687.9250314.
  - Total elapsed time: 26973.94828104973.
Iteration 4002:
  - Iteration time: 20.370080947875977.
  - Absolute time: 1643077708.2951124.
  - Total elapsed time: 26994.318361997604.
Iteration 4003:
  - Iteration time: 4.543852806091309.
  - Absolute time: 1643077712.8389652.
  - Total elapsed time: 26998.862214803696.
Iteration 4004:
  - Iteration time: 1.0939826965332031.
  - Absolute time: 1643077713.9329479.
  - Total elapsed time: 26999.95619750023.

This is about 8h, and it has run for 4k iterations.

The reason the slop of memory consumption is smaller as time goes by without the PR is that the test is running slower as time goes.

There are a couple of options here:

  • remove the expensive fields in actors (I probably will do this)
  • limit the actor numbers (we’ll lose the actor history as time going)
  • store the info into disk-based DBS. (long-term plan maybe).
1reaction
iychengcommented, Jan 10, 2022

I disable the actor info in dashboard ad-hoc, and the issues are still there. If we trust tracemalloc, then the leak has to be in cpp layer. I’ll profile the memory there.

Read more comments on GitHub >

github_iconTop Results From Across the Web

15 Silly Mistakes ("Automatic Fails") That Prevent You From ...
Many drivers slow down, but do not actually come to a complete stop during the road test. You must come to a full...
Read more >
FAILED Drive Test - Hit the Cones - STUPID Mistake...
If you would like to have your drive test video featured on our ... Manual for New Drivers (Your Parents Will Sleep Better...
Read more >
The biggest fails from Driving Test | Driving Test 2020 - YouTube
All the times Driving Test taught us how NOT to drive | The biggest fails from Driving Test.Subscribe here: https://bit.ly/2nI3zka Full ...
Read more >
Continuous Integration - Martin Fowler
The result of running the test suite should indicate if any tests failed. For a build to be self-testing the failure of a...
Read more >
Just Say No to More End-to-End Tests - Google Testing Blog
Finding the root cause for a failing end-to-end test is painful and can take a long time. Partner failures and lab failures ruined...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found