[Core] [Bug] ray memory command produces inconsistent results
See original GitHub issueSearch before asking
- I searched the issues and found no similar issues.
Ray Component
Monitoring & Debugging, Dashboard
What happened + What you expected to happen
We use the command “ray memory” to check the status of Ray’s object store. It turns out that the result generated from the command is very confusing.
The testbed needs to have two nodes. On one node, an object is put into object store, and then run the command to check the status. The aggregated results on both nodes are not consistent with the summary.
(1) On the 47 node running the command, the summary of the 12 node is 264B, the summary of the 47 node is 1600MB, but the aggregated is 762MB.
(2) On the 12 node running the command, the summary of the 12 node is 800MB, the summary of the 47 node is 800MB, but the aggregated is 762MB.
The summary and the aggregated results seem using different metrics, in terms of what is counted and what is not counted. At least, the number should be consistent.
The following is the logs.
check_memory pid=3591574, ip=10.128.91.47) ======== Object references status: 2022-01-05 08:40:04.983423 ========
(check_memory pid=3591574, ip=10.128.91.47) Grouping by node address… Sorting by object size… Display allentries per group…
(check_memory pid=3591574, ip=10.128.91.47)
(check_memory pid=3591574, ip=10.128.91.47)
(check_memory pid=3591574, ip=10.128.91.47) — Summary for node address: 10.128.115.12 —
(check_memory pid=3591574, ip=10.128.91.47) Mem Used by Objects Local References Pinned Pending Tasks Captured in Objects Actor Handles
(check_memory pid=3591574, ip=10.128.91.47) 264.0 B 2, (263.0 B) 0, (0.0 B) 0, (0.0 B) 0, (0.0 B) 2, (-2.0 B)
(check_memory pid=3591574, ip=10.128.91.47)
(check_memory pid=3591574, ip=10.128.91.47) — Object references for node address: 10.128.115.12 —
(check_memory pid=3591574, ip=10.128.91.47) IP Address | PID | Type | Call Site | Size | Reference Type | Object Ref
(check_memory pid=3591574, ip=10.128.91.47)
(check_memory pid=3591574, ip=10.128.91.47) 10.128.115.12 | 447254 | Driver | | ? | LOCAL_REFERENCE | ee4e90da584ab0ebffffffffffffffffffffffff0100000001000000
(check_memory pid=3591574, ip=10.128.91.47)
(check_memory pid=3591574, ip=10.128.91.47)
(check_memory pid=3591574, ip=10.128.91.47) 10.128.115.12 | 447254 | Driver | <unknown> | ? | ACTOR_HANDLE | ffffffffffffffff5ba0f22968968612a61467950100000002000000
(check_memory pid=3591574, ip=10.128.91.47)
(check_memory pid=3591574, ip=10.128.91.47)
(check_memory pid=3591574, ip=10.128.91.47) 10.128.115.12 | 447254 | Driver | | ? | ACTOR_HANDLE | ffffffffffffffff5ba0f22968968612a61467950100000001000000
(check_memory pid=3591574, ip=10.128.91.47)
(check_memory pid=3591574, ip=10.128.91.47)
(check_memory pid=3591574, ip=10.128.91.47) 10.128.115.12 | 447254 | Driver | | 264.0 B | LOCAL_REFERENCE | 69a6825d641b46135ba0f22968968612a61467950100000001000000
(check_memory pid=3591574, ip=10.128.91.47)
(check_memory pid=3591574, ip=10.128.91.47) -– Summary for node address: 10.128.91.47 —
(check_memory pid=3591574, ip=10.128.91.47) Mem Used by Objects Local References Pinned Pending Tasks Captured in Objects Actor Handles
(check_memory pid=3591574, ip=10.128.91.47) 1600000510.0 B 0, (0.0 B) 0, (0.0 B) 0, (0.0 B) 0, (0.0 B) 3, (1600000509.0 B)
(check_memory pid=3591574, ip=10.128.91.47)
(check_memory pid=3591574, ip=10.128.91.47) — Object references for node address: 10.128.91.47 —
(check_memory pid=3591574, ip=10.128.91.47) IP Address | PID | Type | Call Site | Size | Reference Type | Object Ref
(check_memory pid=3591574, ip=10.128.91.47)
(check_memory pid=3591574, ip=10.128.91.47) 10.128.91.47 | 3591538 | Worker | | ? | ACTOR_HANDLE | ffffffffffffffff5ba0f22968968612a61467950100000001000000
(check_memory pid=3591574, ip=10.128.91.47)
(check_memory pid=3591574, ip=10.128.91.47)
(check_memory pid=3591574, ip=10.128.91.47) 10.128.91.47 | 3591538 | Worker | | 800000255.0 B | ACTOR_HANDLE | ffffffffffffffff5ba0f22968968612a61467950100000002000000
(check_memory pid=3591574, ip=10.128.91.47)
(check_memory pid=3591574, ip=10.128.91.47)
(check_memory pid=3591574, ip=10.128.91.47) 10.128.91.47 | 3591574 | Worker | (deserialize task arg) memory.check_memory | 800000255.0 B | ACTOR_HANDLE | ffffffffffffffff5ba0f22968968612a61467950100000002000000
(check_memory pid=3591574, ip=10.128.91.47)
(check_memory pid=3591574, ip=10.128.91.47) To record callsite information for each ObjectRef created, set env variable RAY_record_ref_creation_sites=1
(check_memory pid=3591574, ip=10.128.91.47)
(check_memory pid=3591574, ip=10.128.91.47) — Aggregate object store stats across all nodes —
(check_memory pid=3591574, ip=10.128.91.47) Plasma memory usage 762 MiB, 1 objects, 0.2% full, 0.2% needed
(check_memory pid=3591574, ip=10.128.91.47)
(check_memory pid=447289) ======== Object references status: 2022-01-05 08:40:05.533676 ========
(check_memory pid=447289) Grouping by node address… Sorting by object size… Display allentries per group…
(check_memory pid=447289)
(check_memory pid=447289)
(check_memory pid=447289) — Summary for node address: 10.128.115.12 —
(check_memory pid=447289) Mem Used by Objects Local References Pinned Pending Tasks Captured in Objects Actor Handles
(check_memory pid=447289) 800000519.0 B 2, (263.0 B) 0, (0.0 B) 0, (0.0 B) 0, (0.0 B) 3, (800000253.0 B)
(check_memory pid=447289)
(check_memory pid=447289) — Object references for node address: 10.128.115.12 —
(check_memory pid=447289) IP Address | PID | Type | Call Site | Size | Reference Type | Object Ref
(check_memory pid=447289)
(check_memory pid=447289) 10.128.115.12 | 447254 | Driver | | ? | LOCAL_REFERENCE | 4ee449587774c1f0ffffffffffffffffffffffff0100000001000000
(check_memory pid=447289)
(check_memory pid=447289)
(check_memory pid=447289) 10.128.115.12 | 447254 | Driver | <unknown> | ? | ACTOR_HANDLE | ffffffffffffffff5ba0f22968968612a61467950100000002000000
(check_memory pid=447289)
(check_memory pid=447289)
(check_memory pid=447289) 10.128.115.12 | 447254 | Driver | | ? | ACTOR_HANDLE | ffffffffffffffff5ba0f22968968612a61467950100000001000000
(check_memory pid=447289)
(check_memory pid=447289)
(check_memory pid=447289) 10.128.115.12 | 447254 | Driver | | 264.0 B | LOCAL_REFERENCE | 69a6825d641b46135ba0f22968968612a61467950100000001000000
(check_memory pid=447289)
(check_memory pid=447289)
(check_memory pid=447289) 10.128.115.12 | 447289 | Worker | (deserialize task arg) memory.check_memory | 800000255.0 B | ACTOR_HANDLE | ffffffffffffffff5ba0f22968968612a61467950100000002000000
(check_memory pid=447289)
(check_memory pid=447289) — Summary for node address: 10.128.91.47 —
(check_memory pid=447289) Mem Used by Objects Local References Pinned Pending Tasks Captured in Objects Actor Handles
(check_memory pid=447289) 800000255.0 B 0, (0.0 B) 0, (0.0 B) 0, (0.0 B) 0, (0.0 B) 2, (800000254.0 B)
(check_memory pid=447289)
(check_memory pid=447289) — Object references for node address: 10.128.91.47 —
(check_memory pid=447289) IP Address | PID | Type | Call Site | Size | Reference Type | Object Ref
(check_memory pid=447289)
(check_memory pid=447289) 10.128.91.47 | 3591538 | Worker | | ? | ACTOR_HANDLE | ffffffffffffffff5ba0f22968968612a61467950100000001000000
(check_memory pid=447289)
(check_memory pid=447289)
(check_memory pid=447289) 10.128.91.47 | 3591538 | Worker | | 800000255.0 B | ACTOR_HANDLE | ffffffffffffffff5ba0f22968968612a61467950100000002000000
(check_memory pid=447289)
(check_memory pid=447289) To record callsite information for each ObjectRef created, set env variable RAY_record_ref_creation_sites=1
(check_memory pid=447289)
(check_memory pid=447289) -– Aggregate object store stats across all nodes —
(check_memory pid=447289) Plasma memory usage 762 MiB, 1 objects, 0.2% full, 0.2% needed
(check_memory pid=447289)
Versions / Dependencies
Ray = 1.9.0
Reproduction script
import ray
import numpy as np
import socket
import os
total_num = 100000000
actor_name = "global_actor"
pg_name = "pg_name"
@ray.remote
class DataActor:
def __init__(self):
self.data = ray.put(np.ones(total_num))
def get_data(self):
return self.data
@ray.remote
def check_memory(data):
os.system("ray memory")
return 1
ray.init(address="auto")
# define placement_group to distribute tasks
# across ray cluster nodes
num_nodes = len(ray.nodes())
bundles = [{"CPU": 1}]*num_nodes
placement_group = ray.util.placement_group(
name = pg_name,
strategy="SPREAD",
bundles=bundles,
)
ray.get(placement_group.ready())
my_actor = DataActor.options(
name=actor_name,
placement_group=placement_group,
placement_group_bundle_index=0,
).remote()
data = my_actor.get_data.remote()
ref = check_memory.options(
placement_group=placement_group,
placement_group_bundle_index=0,
).remote(data)
result = ray.get(ref)
ref = check_memory.options(
placement_group=placement_group,
placement_group_bundle_index=num_nodes-1,
).remote(data)
result = ray.get(ref)
Anything else
No response
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (4 by maintainers)
Top GitHub Comments
@clarkzinzow No. In the raylet logs, there is just one warning: /opt/conda/envs/py37/lib/python3.7/site-packages/ray/dashboard/modules/reporter/reporter_agent.py:38: UserWarning:
gpustat
package is not installed. GPU monitoring is not available. To have full functionality of the dashboard please installpip install ray[default]
.) warnings.warn("gpustat
package is not installed. GPU monitoring is "Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.
Please feel free to reopen or open a new issue if you’d still like it to be addressed.
Again, you can always ask for help on our discussion forum or Ray’s public slack channel.
Thanks again for opening the issue!