'ray memory' fails if there are many objects in scope
See original GitHub issueWhat is the problem?
Helping a user debug OOM errors and asked them to run ray memory
. ray memory
crashed with the following output:
2020-05-19 02:13:32,283 INFO scripts.py:976 -- Connecting to Ray instance at 172.31.6.12:34940.
2020-05-19 02:13:32,284 WARNING worker.py:809 -- When connecting to an existing cluster, _internal_config must match the cluster's _internal_config.
(pid=5906) E0519 02:13:32.383447 5906 plasma_store_provider.cc:108] Failed to put object d47fe8ca624da001ffffffff010000c801000000 in object store because it is full. Object size is 196886 bytes.
(pid=5906) Waiting 1000ms for space to free up...
(pid=5906) 2020-05-19 02:13:32,594 INFO (unknown file):0 -- gc.collect() freed 10 refs in 0.11551751299975876 seconds
(pid=5771) E0519 02:13:32.686894 5771 plasma_store_provider.cc:118] Failed to put object 72e67d09154b35b1ffffffff010000c801000000 after 6 attempts. Plasma store status:
(pid=5771) num clients with quota: 0
(pid=5771) quota map size: 0
(pid=5771) pinned quota map size: 0
(pid=5771) allocated bytes: 19130609999
(pid=5771) allocation limit: 19130641612
(pid=5771) pinned bytes: 19130609999
(pid=5771) (global lru) capacity: 19130641612
(pid=5771) (global lru) used: 0%
(pid=5771) (global lru) num objects: 0
(pid=5771) (global lru) num evictions: 0
(pid=5771) (global lru) bytes evicted: 0
(pid=5771) ---
(pid=5771) --- Tip: Use the `ray memory` command to list active objects in the cluster.
(pid=5771) ---
(pid=5771) E0519 02:13:32.880080 5771 plasma_store_provider.cc:108] Failed to put object 1f5c36abed661dbeffffffff010000c801000000 in object store because it is full. Object size is 196886 bytes.
(pid=5771) Waiting 1000ms for space to free up...
(pid=5769) E0519 02:13:32.882894 5769 plasma_store_provider.cc:108] Failed to put object cb31822e7f0e3c70ffffffff010000c801000000 in object store because it is full. Object size is 196886 bytes.
(pid=5769) Waiting 2000ms for space to free up...
(pid=5771) 2020-05-19 02:13:33,215 INFO (unknown file):0 -- gc.collect() freed 10 refs in 0.23763301200006026 seconds
(pid=5906) E0519 02:13:33.383901 5906 plasma_store_provider.cc:108] Failed to put object d47fe8ca624da001ffffffff010000c801000000 in object store because it is full. Object size is 196886 bytes.
(pid=5906) Waiting 2000ms for space to free up...
Traceback (most recent call last):
File "/home/ubuntu/src/seeweed/ml/bin/ray", line 8, in <module>
sys.exit(main())
File "/home/ubuntu/src/seeweed/ml/lib/python3.7/site-packages/ray/scripts/scripts.py", line 1028, in main
return cli()
File "/home/ubuntu/src/seeweed/ml/lib/python3.7/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/home/ubuntu/src/seeweed/ml/lib/python3.7/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/ubuntu/src/seeweed/ml/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/ubuntu/src/seeweed/ml/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ubuntu/src/seeweed/ml/lib/python3.7/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/ubuntu/src/seeweed/ml/lib/python3.7/site-packages/ray/scripts/scripts.py", line 978, in memory
print(ray.internal.internal_api.memory_summary())
File "/home/ubuntu/src/seeweed/ml/lib/python3.7/site-packages/ray/internal/internal_api.py", line 28, in memory_summary
node_manager_pb2.FormatGlobalMemoryInfoRequest(), timeout=30.0)
File "/home/ubuntu/src/seeweed/ml/lib/python3.7/site-packages/grpc/_channel.py", line 826, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/home/ubuntu/src/seeweed/ml/lib/python3.7/site-packages/grpc/_channel.py", line 729, in _end_unary_response_blocking
raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.RESOURCE_EXHAUSTED
details = "Received message larger than max (28892999 vs. 4194304)"
debug_error_string = "{"created":"@1589854413.712252174","description":"Received message larger than max (28892999 vs. 4194304)","file":"src/core/ext/filters/message_size/message_size_filter.cc","file_line":188,"grpc_status":8}"
>
(pid=5771) E0519 02:13:33.880635 5771 plasma_store_provider.cc:108] Failed to put object 1f5c36abed661dbeffffffff010000c801000000 in object store because it is full. Object size is 196886 bytes.
(pid=5771) Waiting 2000ms for space to free up...
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
Memory Management — Ray 0.8.4 documentation
See Debugging using 'ray memory' for information on how to identify what objects are in scope in your application. This exception is raised...
Read more >Out of Memory with RAY Python Framework - Stack Overflow
There can be many possible problems. For my case, I found that ipython creates a reference to python objects when I use it...
Read more >Frequently Asked Questions — PyTorch 1.13 documentation
Frequently Asked Questions. My model reports “cuda runtime error(2): out of memory”. As the error message suggests, you have run out of memory...
Read more >Memory Usage Optimizations for GPU rendering
There are several ways to monitor GPU Memory Usage and Utilization if needed: V-Ray GPU reports how much memory is used for ...
Read more >Ray Tips and Tricks, Part 2 — ray.get() - Medium
When you call ray.get() , it blocks until the corresponding ... in Ray's local object store (the objects cached in memory with Plasma)....
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@pitoupitou Hi, are these gc.collect() messages normal behavior? I’m getting a lot of them, although my job is not erroring out.
@ericl I will set this P1 because it looks pretty important for anyone who uses big clutsers. Let’s find the assignee in the next planning.