question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Core] [Bug] ray memory command produces inconsistent results

See original GitHub issue

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Monitoring & Debugging, Dashboard

What happened + What you expected to happen

We use the command “ray memory” to check the status of Ray’s object store. It turns out that the result generated from the command is very confusing.

The testbed needs to have two nodes. On one node, an object is put into object store, and then run the command to check the status. The aggregated results on both nodes are not consistent with the summary.
(1) On the 47 node running the command, the summary of the 12 node is 264B, the summary of the 47 node is 1600MB, but the aggregated is 762MB. (2) On the 12 node running the command, the summary of the 12 node is 800MB, the summary of the 47 node is 800MB, but the aggregated is 762MB.

The summary and the aggregated results seem using different metrics, in terms of what is counted and what is not counted. At least, the number should be consistent.

The following is the logs.

check_memory pid=3591574, ip=10.128.91.47) ======== Object references status: 2022-01-05 08:40:04.983423 ======== (check_memory pid=3591574, ip=10.128.91.47) Grouping by node address… Sorting by object size… Display allentries per group… (check_memory pid=3591574, ip=10.128.91.47) (check_memory pid=3591574, ip=10.128.91.47) (check_memory pid=3591574, ip=10.128.91.47) — Summary for node address: 10.128.115.12 — (check_memory pid=3591574, ip=10.128.91.47) Mem Used by Objects Local References Pinned Pending Tasks Captured in Objects Actor Handles (check_memory pid=3591574, ip=10.128.91.47) 264.0 B 2, (263.0 B) 0, (0.0 B) 0, (0.0 B) 0, (0.0 B) 2, (-2.0 B)
(check_memory pid=3591574, ip=10.128.91.47) (check_memory pid=3591574, ip=10.128.91.47) — Object references for node address: 10.128.115.12 — (check_memory pid=3591574, ip=10.128.91.47) IP Address | PID | Type | Call Site | Size | Reference Type | Object Ref (check_memory pid=3591574, ip=10.128.91.47) (check_memory pid=3591574, ip=10.128.91.47) 10.128.115.12 | 447254 | Driver | | ? | LOCAL_REFERENCE | ee4e90da584ab0ebffffffffffffffffffffffff0100000001000000 (check_memory pid=3591574, ip=10.128.91.47) (check_memory pid=3591574, ip=10.128.91.47) (check_memory pid=3591574, ip=10.128.91.47) 10.128.115.12 | 447254 | Driver | <unknown> | ? | ACTOR_HANDLE | ffffffffffffffff5ba0f22968968612a61467950100000002000000 (check_memory pid=3591574, ip=10.128.91.47) (check_memory pid=3591574, ip=10.128.91.47) (check_memory pid=3591574, ip=10.128.91.47) 10.128.115.12 | 447254 | Driver | | ? | ACTOR_HANDLE | ffffffffffffffff5ba0f22968968612a61467950100000001000000 (check_memory pid=3591574, ip=10.128.91.47) (check_memory pid=3591574, ip=10.128.91.47) (check_memory pid=3591574, ip=10.128.91.47) 10.128.115.12 | 447254 | Driver | | 264.0 B | LOCAL_REFERENCE | 69a6825d641b46135ba0f22968968612a61467950100000001000000 (check_memory pid=3591574, ip=10.128.91.47) (check_memory pid=3591574, ip=10.128.91.47) -– Summary for node address: 10.128.91.47 — (check_memory pid=3591574, ip=10.128.91.47) Mem Used by Objects Local References Pinned Pending Tasks Captured in Objects Actor Handles (check_memory pid=3591574, ip=10.128.91.47) 1600000510.0 B 0, (0.0 B) 0, (0.0 B) 0, (0.0 B) 0, (0.0 B) 3, (1600000509.0 B) (check_memory pid=3591574, ip=10.128.91.47) (check_memory pid=3591574, ip=10.128.91.47) — Object references for node address: 10.128.91.47 — (check_memory pid=3591574, ip=10.128.91.47) IP Address | PID | Type | Call Site | Size | Reference Type | Object Ref (check_memory pid=3591574, ip=10.128.91.47) (check_memory pid=3591574, ip=10.128.91.47) 10.128.91.47 | 3591538 | Worker | | ? | ACTOR_HANDLE | ffffffffffffffff5ba0f22968968612a61467950100000001000000 (check_memory pid=3591574, ip=10.128.91.47) (check_memory pid=3591574, ip=10.128.91.47) (check_memory pid=3591574, ip=10.128.91.47) 10.128.91.47 | 3591538 | Worker | | 800000255.0 B | ACTOR_HANDLE | ffffffffffffffff5ba0f22968968612a61467950100000002000000 (check_memory pid=3591574, ip=10.128.91.47) (check_memory pid=3591574, ip=10.128.91.47) (check_memory pid=3591574, ip=10.128.91.47) 10.128.91.47 | 3591574 | Worker | (deserialize task arg) memory.check_memory | 800000255.0 B | ACTOR_HANDLE | ffffffffffffffff5ba0f22968968612a61467950100000002000000 (check_memory pid=3591574, ip=10.128.91.47) (check_memory pid=3591574, ip=10.128.91.47) To record callsite information for each ObjectRef created, set env variable RAY_record_ref_creation_sites=1 (check_memory pid=3591574, ip=10.128.91.47) (check_memory pid=3591574, ip=10.128.91.47) — Aggregate object store stats across all nodes — (check_memory pid=3591574, ip=10.128.91.47) Plasma memory usage 762 MiB, 1 objects, 0.2% full, 0.2% needed (check_memory pid=3591574, ip=10.128.91.47) (check_memory pid=447289) ======== Object references status: 2022-01-05 08:40:05.533676 ======== (check_memory pid=447289) Grouping by node address… Sorting by object size… Display allentries per group… (check_memory pid=447289) (check_memory pid=447289) (check_memory pid=447289) — Summary for node address: 10.128.115.12 — (check_memory pid=447289) Mem Used by Objects Local References Pinned Pending Tasks Captured in Objects Actor Handles (check_memory pid=447289) 800000519.0 B 2, (263.0 B) 0, (0.0 B) 0, (0.0 B) 0, (0.0 B) 3, (800000253.0 B) (check_memory pid=447289) (check_memory pid=447289) — Object references for node address: 10.128.115.12 — (check_memory pid=447289) IP Address | PID | Type | Call Site | Size | Reference Type | Object Ref (check_memory pid=447289) (check_memory pid=447289) 10.128.115.12 | 447254 | Driver | | ? | LOCAL_REFERENCE | 4ee449587774c1f0ffffffffffffffffffffffff0100000001000000 (check_memory pid=447289) (check_memory pid=447289) (check_memory pid=447289) 10.128.115.12 | 447254 | Driver | <unknown> | ? | ACTOR_HANDLE | ffffffffffffffff5ba0f22968968612a61467950100000002000000 (check_memory pid=447289) (check_memory pid=447289) (check_memory pid=447289) 10.128.115.12 | 447254 | Driver | | ? | ACTOR_HANDLE | ffffffffffffffff5ba0f22968968612a61467950100000001000000 (check_memory pid=447289) (check_memory pid=447289) (check_memory pid=447289) 10.128.115.12 | 447254 | Driver | | 264.0 B | LOCAL_REFERENCE | 69a6825d641b46135ba0f22968968612a61467950100000001000000 (check_memory pid=447289) (check_memory pid=447289) (check_memory pid=447289) 10.128.115.12 | 447289 | Worker | (deserialize task arg) memory.check_memory | 800000255.0 B | ACTOR_HANDLE | ffffffffffffffff5ba0f22968968612a61467950100000002000000 (check_memory pid=447289) (check_memory pid=447289) — Summary for node address: 10.128.91.47 — (check_memory pid=447289) Mem Used by Objects Local References Pinned Pending Tasks Captured in Objects Actor Handles (check_memory pid=447289) 800000255.0 B 0, (0.0 B) 0, (0.0 B) 0, (0.0 B) 0, (0.0 B) 2, (800000254.0 B) (check_memory pid=447289) (check_memory pid=447289) — Object references for node address: 10.128.91.47 — (check_memory pid=447289) IP Address | PID | Type | Call Site | Size | Reference Type | Object Ref (check_memory pid=447289) (check_memory pid=447289) 10.128.91.47 | 3591538 | Worker | | ? | ACTOR_HANDLE | ffffffffffffffff5ba0f22968968612a61467950100000001000000 (check_memory pid=447289) (check_memory pid=447289) (check_memory pid=447289) 10.128.91.47 | 3591538 | Worker | | 800000255.0 B | ACTOR_HANDLE | ffffffffffffffff5ba0f22968968612a61467950100000002000000 (check_memory pid=447289) (check_memory pid=447289) To record callsite information for each ObjectRef created, set env variable RAY_record_ref_creation_sites=1 (check_memory pid=447289) (check_memory pid=447289) -– Aggregate object store stats across all nodes — (check_memory pid=447289) Plasma memory usage 762 MiB, 1 objects, 0.2% full, 0.2% needed (check_memory pid=447289)

Versions / Dependencies

Ray = 1.9.0

Reproduction script

import ray
  import numpy as np
  import socket
  import os
  
  total_num = 100000000
  actor_name = "global_actor"
  pg_name = "pg_name"
  
  @ray.remote
  class DataActor:
      def __init__(self):
          self.data = ray.put(np.ones(total_num))
  
      def get_data(self):
          return self.data
  
  @ray.remote
  def check_memory(data):
      os.system("ray memory")
      return 1
  
  ray.init(address="auto")
  
  # define placement_group to distribute tasks 
  # across ray cluster nodes
  num_nodes = len(ray.nodes())
  
  bundles = [{"CPU": 1}]*num_nodes 
  placement_group = ray.util.placement_group(
      name = pg_name,
      strategy="SPREAD", 
      bundles=bundles,
      )
  ray.get(placement_group.ready())
  
  my_actor = DataActor.options(
      name=actor_name,
      placement_group=placement_group,
      placement_group_bundle_index=0,
      ).remote()
  
  data = my_actor.get_data.remote()
  
  ref = check_memory.options(
      placement_group=placement_group,
      placement_group_bundle_index=0,
  ).remote(data)
  result = ray.get(ref)
  
  ref = check_memory.options(
      placement_group=placement_group,
      placement_group_bundle_index=num_nodes-1,
  ).remote(data)
  result = ray.get(ref)

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
xinchen384commented, Jan 5, 2022

@clarkzinzow No. In the raylet logs, there is just one warning: /opt/conda/envs/py37/lib/python3.7/site-packages/ray/dashboard/modules/reporter/reporter_agent.py:38: UserWarning: gpustat package is not installed. GPU monitoring is not available. To have full functionality of the dashboard please install pip install ray[default].) warnings.warn("gpustat package is not installed. GPU monitoring is "

0reactions
stale[bot]commented, Sep 24, 2022

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you’d still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray’s public slack channel.

Thanks again for opening the issue!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Configuring Ray — Ray 2.2.0 - the Ray documentation
This page discusses the various way to configure Ray, both from the Python API and from the command line. Take a look at...
Read more >
ORCA Common Errors and Problems - Google Sites
Memory usage in ORCA is controlled by the %maxcore keyword where the user should specify the memory per core in MB that the...
Read more >
Large dataset let ram memory explode during caching - DVC
The dvc pipeline in data registry creates as output 2 more folders a subsample and a dev set of the original full dataset....
Read more >
What is ray::IDLE and why are some of the workers running ...
You can probably mitigate the problem by 2 things. 1. Set the num_cpus argument of ray_init to be lower (like 2~3) so that...
Read more >
Beta - POV-Ray
When rendering an image that has output to file turned on, POV-Ray will cache ... of the resulting memory buffer causes parts of...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found