[Core] [Nightly] [Flaky] `many_drivers` test failed
See original GitHub issueSearch before asking
- I searched the issues and found no similar issues.
Ray Component
Ray Core
What happened + What you expected to happen
(run_driver pid=3156456) ray.exceptions.RayTaskError(RayOutOfMemoryError): ray::f() (pid=3167980, ip=172.31.62.177)
--
| (run_driver pid=3156456) ray._private.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node ip-172-31-62-177 is used (29.1 / 30.57 GB). The top 10 memory consumers are:
| (run_driver pid=3156456)
| (run_driver pid=3156456) PID MEM COMMAND
| (run_driver pid=3156456) 755 21.37GiB /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/dash
| (run_driver pid=3156456) 729 0.41GiB /home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/s
| (run_driver pid=3156456) 800 0.11GiB /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/log_m
| (run_driver pid=3156456) 51 0.09GiB /home/ray/anaconda3/bin/python /home/ray/anaconda3/bin/anyscale session web_terminal_server --deploy
| (run_driver pid=3156456) 378 0.09GiB /home/ray/anaconda3/bin/python /home/ray/anaconda3/bin/anyscale session auth_start
| (run_driver pid=3156456) 691 0.09GiB python workloads/many_drivers.py
| (run_driver pid=3156456) 834 0.09GiB /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/agen
| (run_driver pid=3156456) 890 0.08GiB /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/agen
| (run_driver pid=3156456) 1014 0.08GiB /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/agen
| (run_driver pid=3156456) 953 0.08GiB /home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/agenTraceback (most recent call last):
| File "workloads/many_drivers.py", line 95, in <module>
| ray.get(ready_id)
| File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
| return func(*args, **kwargs)
| File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1742, in get
| raise value.as_instanceof_cause()
| ray.exceptions.RayTaskError(CalledProcessError): ray::run_driver() (pid=3136504, ip=172.31.62.177)
| File "workloads/many_drivers.py", line 80, in run_driver
| output = run_string_as_driver(driver_script)
| subprocess.CalledProcessError: Command '['/home/ray/anaconda3/bin/python', '-']' returned non-zero exit status 1.
It seems the cluster is OOMing and the dashboard is to blame.
Versions / Dependencies
master
Reproduction script
Run the weekly tests. example output: https://buildkite.com/ray-project/periodic-ci/builds/2247#309c72ce-6ceb-4e86-9470-3c429fb1bf81
Anything else
No response
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Issue Analytics
- State:
- Created 2 years ago
- Comments:36 (36 by maintainers)
Top Results From Across the Web
15 Silly Mistakes ("Automatic Fails") That Prevent You From ...
Many drivers slow down, but do not actually come to a complete stop during the road test. You must come to a full...
Read more >FAILED Drive Test - Hit the Cones - STUPID Mistake...
If you would like to have your drive test video featured on our ... Manual for New Drivers (Your Parents Will Sleep Better...
Read more >The biggest fails from Driving Test | Driving Test 2020 - YouTube
All the times Driving Test taught us how NOT to drive | The biggest fails from Driving Test.Subscribe here: https://bit.ly/2nI3zka Full ...
Read more >Continuous Integration - Martin Fowler
The result of running the test suite should indicate if any tests failed. For a build to be self-testing the failure of a...
Read more >Just Say No to More End-to-End Tests - Google Testing Blog
Finding the root cause for a failing end-to-end test is painful and can take a long time. Partner failures and lab failures ruined...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Lower the priority to P1. The reason it failed is that the master is faster after this PR.
Memory leak still exists in the dashboard. But without this PR, every worker basically needs to import a lot of things and it slows the driver’s performance.
We can see from this log
It cost a lot of time to finish one iteration and within 24h, it only finished 2k iterations.
With this PR, it’s like:
This is about 8h, and it has run for 4k iterations.
The reason the slop of memory consumption is smaller as time goes by without the PR is that the test is running slower as time goes.
There are a couple of options here:
I disable the actor info in dashboard ad-hoc, and the issues are still there. If we trust tracemalloc, then the leak has to be in cpp layer. I’ll profile the memory there.