question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

'ray memory' fails if there are many objects in scope

See original GitHub issue

What is the problem?

Helping a user debug OOM errors and asked them to run ray memory. ray memory crashed with the following output:

2020-05-19 02:13:32,283	INFO scripts.py:976 -- Connecting to Ray instance at 172.31.6.12:34940.
2020-05-19 02:13:32,284	WARNING worker.py:809 -- When connecting to an existing cluster, _internal_config must match the cluster's _internal_config.
(pid=5906) E0519 02:13:32.383447  5906 plasma_store_provider.cc:108] Failed to put object d47fe8ca624da001ffffffff010000c801000000 in object store because it is full. Object size is 196886 bytes.
(pid=5906) Waiting 1000ms for space to free up...
(pid=5906) 2020-05-19 02:13:32,594	INFO (unknown file):0 -- gc.collect() freed 10 refs in 0.11551751299975876 seconds
(pid=5771) E0519 02:13:32.686894  5771 plasma_store_provider.cc:118] Failed to put object 72e67d09154b35b1ffffffff010000c801000000 after 6 attempts. Plasma store status:
(pid=5771) num clients with quota: 0
(pid=5771) quota map size: 0
(pid=5771) pinned quota map size: 0
(pid=5771) allocated bytes: 19130609999
(pid=5771) allocation limit: 19130641612
(pid=5771) pinned bytes: 19130609999
(pid=5771) (global lru) capacity: 19130641612
(pid=5771) (global lru) used: 0%
(pid=5771) (global lru) num objects: 0
(pid=5771) (global lru) num evictions: 0
(pid=5771) (global lru) bytes evicted: 0
(pid=5771) ---
(pid=5771) --- Tip: Use the `ray memory` command to list active objects in the cluster.
(pid=5771) ---
(pid=5771) E0519 02:13:32.880080  5771 plasma_store_provider.cc:108] Failed to put object 1f5c36abed661dbeffffffff010000c801000000 in object store because it is full. Object size is 196886 bytes.
(pid=5771) Waiting 1000ms for space to free up...
(pid=5769) E0519 02:13:32.882894  5769 plasma_store_provider.cc:108] Failed to put object cb31822e7f0e3c70ffffffff010000c801000000 in object store because it is full. Object size is 196886 bytes.
(pid=5769) Waiting 2000ms for space to free up...
(pid=5771) 2020-05-19 02:13:33,215	INFO (unknown file):0 -- gc.collect() freed 10 refs in 0.23763301200006026 seconds
(pid=5906) E0519 02:13:33.383901  5906 plasma_store_provider.cc:108] Failed to put object d47fe8ca624da001ffffffff010000c801000000 in object store because it is full. Object size is 196886 bytes.
(pid=5906) Waiting 2000ms for space to free up...
Traceback (most recent call last):
  File "/home/ubuntu/src/seeweed/ml/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/src/seeweed/ml/lib/python3.7/site-packages/ray/scripts/scripts.py", line 1028, in main
    return cli()
  File "/home/ubuntu/src/seeweed/ml/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/ubuntu/src/seeweed/ml/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/ubuntu/src/seeweed/ml/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ubuntu/src/seeweed/ml/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ubuntu/src/seeweed/ml/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/ubuntu/src/seeweed/ml/lib/python3.7/site-packages/ray/scripts/scripts.py", line 978, in memory
    print(ray.internal.internal_api.memory_summary())
  File "/home/ubuntu/src/seeweed/ml/lib/python3.7/site-packages/ray/internal/internal_api.py", line 28, in memory_summary
    node_manager_pb2.FormatGlobalMemoryInfoRequest(), timeout=30.0)
  File "/home/ubuntu/src/seeweed/ml/lib/python3.7/site-packages/grpc/_channel.py", line 826, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/home/ubuntu/src/seeweed/ml/lib/python3.7/site-packages/grpc/_channel.py", line 729, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.RESOURCE_EXHAUSTED
	details = "Received message larger than max (28892999 vs. 4194304)"
	debug_error_string = "{"created":"@1589854413.712252174","description":"Received message larger than max (28892999 vs. 4194304)","file":"src/core/ext/filters/message_size/message_size_filter.cc","file_line":188,"grpc_status":8}"
>
(pid=5771) E0519 02:13:33.880635  5771 plasma_store_provider.cc:108] Failed to put object 1f5c36abed661dbeffffffff010000c801000000 in object store because it is full. Object size is 196886 bytes.
(pid=5771) Waiting 2000ms for space to free up...

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
austinmwcommented, Aug 28, 2020

@pitoupitou Hi, are these gc.collect() messages normal behavior? I’m getting a lot of them, although my job is not erroring out.

0reactions
rkooo567commented, May 25, 2020

@ericl I will set this P1 because it looks pretty important for anyone who uses big clutsers. Let’s find the assignee in the next planning.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Memory Management — Ray 0.8.4 documentation
See Debugging using 'ray memory' for information on how to identify what objects are in scope in your application. This exception is raised...
Read more >
Out of Memory with RAY Python Framework - Stack Overflow
There can be many possible problems. For my case, I found that ipython creates a reference to python objects when I use it...
Read more >
Frequently Asked Questions — PyTorch 1.13 documentation
Frequently Asked Questions. My model reports “cuda runtime error(2): out of memory”. As the error message suggests, you have run out of memory...
Read more >
Memory Usage Optimizations for GPU rendering
There are several ways to monitor GPU Memory Usage and Utilization if needed: V-Ray GPU reports how much memory is used for ...
Read more >
Ray Tips and Tricks, Part 2 — ray.get() - Medium
When you call ray.get() , it blocks until the corresponding ... in Ray's local object store (the objects cached in memory with Plasma)....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found