Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Core] [Nightly] [Flaky] `many_drivers` test failed

See original GitHub issue

Search before asking

I searched the issues and found no similar issues.

Ray Component

Ray Core

What happened + What you expected to happen

(run_driver pid=3156456) ray.exceptions.RayTaskError(RayOutOfMemoryError): ray::f() (pid=3167980, ip=172.31.62.177)
--
  | (run_driver pid=3156456) ray._private.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node ip-172-31-62-177 is used (29.1 / 30.57 GB). The top 10 memory consumers are:
  | (run_driver pid=3156456)
  | (run_driver pid=3156456) PID	MEM	COMMAND
  | (run_driver pid=3156456) 755	21.37GiB	/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/dash
  | (run_driver pid=3156456) 729	0.41GiB	/home/ray/anaconda3/lib/python3.7/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/s
  | (run_driver pid=3156456) 800	0.11GiB	/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/log_m
  | (run_driver pid=3156456) 51	0.09GiB	/home/ray/anaconda3/bin/python /home/ray/anaconda3/bin/anyscale session web_terminal_server --deploy
  | (run_driver pid=3156456) 378	0.09GiB	/home/ray/anaconda3/bin/python /home/ray/anaconda3/bin/anyscale session auth_start
  | (run_driver pid=3156456) 691	0.09GiB	python workloads/many_drivers.py
  | (run_driver pid=3156456) 834	0.09GiB	/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/agen
  | (run_driver pid=3156456) 890	0.08GiB	/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/agen
  | (run_driver pid=3156456) 1014	0.08GiB	/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/agen
  | (run_driver pid=3156456) 953	0.08GiB	/home/ray/anaconda3/bin/python -u /home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/agenTraceback (most recent call last):
  | File "workloads/many_drivers.py", line 95, in <module>
  | ray.get(ready_id)
  | File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
  | return func(*args, **kwargs)
  | File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1742, in get
  | raise value.as_instanceof_cause()
  | ray.exceptions.RayTaskError(CalledProcessError): ray::run_driver() (pid=3136504, ip=172.31.62.177)
  | File "workloads/many_drivers.py", line 80, in run_driver
  | output = run_string_as_driver(driver_script)
  | subprocess.CalledProcessError: Command '['/home/ray/anaconda3/bin/python', '-']' returned non-zero exit status 1.

It seems the cluster is OOMing and the dashboard is to blame.

Versions / Dependencies

master

Reproduction script

Run the weekly tests. example output: https://buildkite.com/ray-project/periodic-ci/builds/2247#309c72ce-6ceb-4e86-9470-3c429fb1bf81

Anything else

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Issue Analytics

State:
Created 2 years ago
Comments:36 (36 by maintainers)

Top GitHub Comments

3reactions

iychengcommented, Jan 26, 2022

Lower the priority to P1. The reason it failed is that the master is faster after this PR.

Memory leak still exists in the dashboard. But without this PR, every worker basically needs to import a lot of things and it slows the driver’s performance.

We can see from this log

Iteration 2221:
  - Iteration time: 91.22895789146423.
  - Absolute time: 1639977131.3817475.
  - Total elapsed time: 82505.75467967987.
Iteration 2222:
  - Iteration time: 89.64419484138489.
  - Absolute time: 1639977221.0259423.
  - Total elapsed time: 82595.39887452126.
Iteration 2223:
  - Iteration time: 70.22041773796082.
  - Absolute time: 1639977291.24636.
  - Total elapsed time: 82665.61929225922.
Iteration 2224:
  - Iteration time: 141.53751230239868.
  - Absolute time: 1639977432.7838724.
  - Total elapsed time: 82807.15680456161.

It cost a lot of time to finish one iteration and within 24h, it only finished 2k iterations.

With this PR, it’s like:

Iteration 4001:
  - Iteration time: 0.95485520362854.
  - Absolute time: 1643077687.9250314.
  - Total elapsed time: 26973.94828104973.
Iteration 4002:
  - Iteration time: 20.370080947875977.
  - Absolute time: 1643077708.2951124.
  - Total elapsed time: 26994.318361997604.
Iteration 4003:
  - Iteration time: 4.543852806091309.
  - Absolute time: 1643077712.8389652.
  - Total elapsed time: 26998.862214803696.
Iteration 4004:
  - Iteration time: 1.0939826965332031.
  - Absolute time: 1643077713.9329479.
  - Total elapsed time: 26999.95619750023.

This is about 8h, and it has run for 4k iterations.

The reason the slop of memory consumption is smaller as time goes by without the PR is that the test is running slower as time goes.

There are a couple of options here:

remove the expensive fields in actors (I probably will do this)
limit the actor numbers (we’ll lose the actor history as time going)
store the info into disk-based DBS. (long-term plan maybe).

1reaction

iychengcommented, Jan 10, 2022

I disable the actor info in dashboard ad-hoc, and the issues are still there. If we trust tracemalloc, then the leak has to be in cpp layer. I’ll profile the memory there.