question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[dashboard] "new_dashboard" is leaking processes on Linux

See original GitHub issue

What is the problem?

Ray version and other system information (Python version, TensorFlow version, OS): Ray 1.1dev

OS: Ubuntu 18.04 Python 3.6

When a ray is started and stopped with ray.init(), I see a process left over like this:

swang    30660  4805  1 16:55 pts/1    00:00:02 /home/swang/anaconda3/envs/ray-36/bin/python -u /home/swang/ray/python/ray/new_dashboard/agent.py ...

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

ray.init()

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:16 (16 by maintainers)

github_iconTop GitHub Comments

1reaction
stephanie-wangcommented, Nov 20, 2020

Oh I think you’re right about that. It seems it’s still leaking processes. Here is the output that I have from a dashboard agent (it repeats this over and over):

2020-11-20 11:21:07,215	INFO agent.py:69 -- Dashboard agent grpc address: XXX.XXX.XXX.XXX:63209
2020-11-20 11:21:07,221	INFO utils.py:201 -- Get all modules by type: DashboardAgentModule
2020-11-20 11:21:07,889	INFO agent.py:82 -- Loading DashboardAgentModule: <class 'ray.new_dashboard.modules.log.log_agent.LogAgent'>
2020-11-20 11:21:07,889	INFO agent.py:82 -- Loading DashboardAgentModule: <class 'ray.new_dashboard.modules.reporter.reporter_agent.ReporterAgent'>
2020-11-20 11:21:07,892	INFO agent.py:86 -- Loaded 2 modules.
2020-11-20 11:21:07,893	INFO agent.py:150 -- Dashboard agent http address: XXX.XXX.XXX.XXX:42441
2020-11-20 11:21:07,894	INFO agent.py:157 -- <ResourceRoute [GET] <StaticResource  /logs -> PosixPath('/tmp/ray/session_2020-11-20_11-21-05_809084_19322/logs')> -> <bound method StaticResource._handle of <StaticResource  /logs -> PosixPath('/tmp/ray/session_2020-11-20_11-21-05_809084_19322/logs')>>
2020-11-20 11:21:07,894	INFO agent.py:157 -- <ResourceRoute [OPTIONS] <StaticResource  /logs -> PosixPath('/tmp/ray/session_2020-11-20_11-21-05_809084_19322/logs')> -> <bound method _PreflightHandler._preflight_handler of <aiohttp_cors.cors_config._CorsConfigImpl object at 0x7f4226661da0>>
2020-11-20 11:21:07,894	INFO agent.py:158 -- Registered 2 routes.
2020-11-20 11:21:10,437	ERROR reporter_agent.py:234 -- Error publishing node physical stats.
Traceback (most recent call last):
  File "/home/swang/ray/python/ray/new_dashboard/modules/reporter/reporter_agent.py", line 232, in _perform_iteration
    await aioredis_client.publish(self._key, jsonify_asdict(stats))
  File "/home/swang/anaconda3/envs/ray-36/lib/python3.6/site-packages/aioredis/pool.py", line 257, in _wait_execute
    conn = await self.acquire(command, args)
  File "/home/swang/anaconda3/envs/ray-36/lib/python3.6/site-packages/aioredis/pool.py", line 324, in acquire
    await self._fill_free(override_min=True)
  File "/home/swang/anaconda3/envs/ray-36/lib/python3.6/site-packages/aioredis/pool.py", line 383, in _fill_free
    conn = await self._create_new_connection(self._address)
  File "/home/swang/anaconda3/envs/ray-36/lib/python3.6/site-packages/aioredis/connection.py", line 113, in create_connection
    timeout)
  File "/home/swang/anaconda3/envs/ray-36/lib/python3.6/asyncio/tasks.py", line 339, in wait_for
    return (yield from fut)
  File "/home/swang/anaconda3/envs/ray-36/lib/python3.6/site-packages/aioredis/stream.py", line 24, in open_connection
    lambda: protocol, host, port, **kwds)
  File "/home/swang/anaconda3/envs/ray-36/lib/python3.6/asyncio/base_events.py", line 798, in create_connection
    raise exceptions[0]
  File "/home/swang/anaconda3/envs/ray-36/lib/python3.6/asyncio/base_events.py", line 785, in create_connection
    yield from self.sock_connect(sock, address)
  File "/home/swang/anaconda3/envs/ray-36/lib/python3.6/asyncio/selector_events.py", line 439, in sock_connect
    return (yield from fut)
  File "/home/swang/anaconda3/envs/ray-36/lib/python3.6/asyncio/selector_events.py", line 469, in _sock_connect_cb
    raise OSError(err, 'Connect call failed %s' % (address,))
ConnectionRefusedError: [Errno 111] Connect call failed ('XXX.XXX.XXX.XXX', 6379)
1reaction
fyrestonecommented, Nov 20, 2020

Max is out until next week. @fyrestone @mxz96102 could you please take a look? This might be linux-only because I’m not able to repro locally.

The dashboard agent has a loop for checking parent is alive. https://github.com/ray-project/ray/blob/master/dashboard/agent.py#L90. Any logs for the leaked dashboard agent?

Read more comments on GitHub >

github_iconTop Results From Across the Web

How can I find a memory leak of a running process?
Here are the steps that almost guarantee to find what is leaking memory: Find out the PID of the process which causing memory...
Read more >
linux - Identifying a process that leaks memory - Server Fault
The usual way I do this is to let the system run until the swap utilization is evidently higher than it "should" be....
Read more >
1920739 – Java process has high memory utilization
Bug 1920739 - Java process has high memory utilization ... Running 6 vms with total of 128GB, dashboard showing over 200GBs used.
Read more >
N-sight RMM Dashboard Release Notes - N-able
To ensure an optimal experience, we regularly release new Dashboard versions. ... UPDATE: Added Linux agent option to the Add Device wizard.
Read more >
Bug #720446 “memory leak in compiz when using places ...
memory leak in compiz when using places, dashboard, and exposing launcher icons with multiple quicklists. Bug #720446 reported by Doug ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found