Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

'ray memory' fails if there are many objects in scope

See original GitHub issue

What is the problem?

Helping a user debug OOM errors and asked them to run ray memory. ray memory crashed with the following output:

2020-05-19 02:13:32,283	INFO scripts.py:976 -- Connecting to Ray instance at 172.31.6.12:34940.
2020-05-19 02:13:32,284	WARNING worker.py:809 -- When connecting to an existing cluster, _internal_config must match the cluster's _internal_config.
(pid=5906) E0519 02:13:32.383447  5906 plasma_store_provider.cc:108] Failed to put object d47fe8ca624da001ffffffff010000c801000000 in object store because it is full. Object size is 196886 bytes.
(pid=5906) Waiting 1000ms for space to free up...
(pid=5906) 2020-05-19 02:13:32,594	INFO (unknown file):0 -- gc.collect() freed 10 refs in 0.11551751299975876 seconds
(pid=5771) E0519 02:13:32.686894  5771 plasma_store_provider.cc:118] Failed to put object 72e67d09154b35b1ffffffff010000c801000000 after 6 attempts. Plasma store status:
(pid=5771) num clients with quota: 0
(pid=5771) quota map size: 0
(pid=5771) pinned quota map size: 0
(pid=5771) allocated bytes: 19130609999
(pid=5771) allocation limit: 19130641612
(pid=5771) pinned bytes: 19130609999
(pid=5771) (global lru) capacity: 19130641612
(pid=5771) (global lru) used: 0%
(pid=5771) (global lru) num objects: 0
(pid=5771) (global lru) num evictions: 0
(pid=5771) (global lru) bytes evicted: 0
(pid=5771) ---
(pid=5771) --- Tip: Use the `ray memory` command to list active objects in the cluster.
(pid=5771) ---
(pid=5771) E0519 02:13:32.880080  5771 plasma_store_provider.cc:108] Failed to put object 1f5c36abed661dbeffffffff010000c801000000 in object store because it is full. Object size is 196886 bytes.
(pid=5771) Waiting 1000ms for space to free up...
(pid=5769) E0519 02:13:32.882894  5769 plasma_store_provider.cc:108] Failed to put object cb31822e7f0e3c70ffffffff010000c801000000 in object store because it is full. Object size is 196886 bytes.
(pid=5769) Waiting 2000ms for space to free up...
(pid=5771) 2020-05-19 02:13:33,215	INFO (unknown file):0 -- gc.collect() freed 10 refs in 0.23763301200006026 seconds
(pid=5906) E0519 02:13:33.383901  5906 plasma_store_provider.cc:108] Failed to put object d47fe8ca624da001ffffffff010000c801000000 in object store because it is full. Object size is 196886 bytes.
(pid=5906) Waiting 2000ms for space to free up...
Traceback (most recent call last):
  File "/home/ubuntu/src/seeweed/ml/bin/ray", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/src/seeweed/ml/lib/python3.7/site-packages/ray/scripts/scripts.py", line 1028, in main
    return cli()
  File "/home/ubuntu/src/seeweed/ml/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/ubuntu/src/seeweed/ml/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/ubuntu/src/seeweed/ml/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ubuntu/src/seeweed/ml/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ubuntu/src/seeweed/ml/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/ubuntu/src/seeweed/ml/lib/python3.7/site-packages/ray/scripts/scripts.py", line 978, in memory
    print(ray.internal.internal_api.memory_summary())
  File "/home/ubuntu/src/seeweed/ml/lib/python3.7/site-packages/ray/internal/internal_api.py", line 28, in memory_summary
    node_manager_pb2.FormatGlobalMemoryInfoRequest(), timeout=30.0)
  File "/home/ubuntu/src/seeweed/ml/lib/python3.7/site-packages/grpc/_channel.py", line 826, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/home/ubuntu/src/seeweed/ml/lib/python3.7/site-packages/grpc/_channel.py", line 729, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.RESOURCE_EXHAUSTED
	details = "Received message larger than max (28892999 vs. 4194304)"
	debug_error_string = "{"created":"@1589854413.712252174","description":"Received message larger than max (28892999 vs. 4194304)","file":"src/core/ext/filters/message_size/message_size_filter.cc","file_line":188,"grpc_status":8}"
>
(pid=5771) E0519 02:13:33.880635  5771 plasma_store_provider.cc:108] Failed to put object 1f5c36abed661dbeffffffff010000c801000000 in object store because it is full. Object size is 196886 bytes.
(pid=5771) Waiting 2000ms for space to free up...