Starting the cluster with memory_limit=None causes failures on the latest nightly
See original GitHub issueStarting the cluster with memory_limit=None causes failures on the latest nightly
Minimal Repro:
from dask_cuda import LocalCUDACluster
from dask.distributed import Client, wait
import dask
def test_func():
return "abc"
if __name__ == "__main__":
cluster = LocalCUDACluster(memory_limit=None)
client = Client(cluster)
test_val = client.submit(test_func)
print(test_val.result())
Trace:
Traceback (most recent call last):
File "test_bug.py", line 14, in <module>
print(test_val.result())
File "/raid/vjawa/conda_install/conda_env/envs/cudf_march_30/lib/python3.7/site-packages/distributed/client.py", line 220, in result
raise exc.with_traceback(tb)
File "/raid/vjawa/conda_install/conda_env/envs/cudf_march_30/lib/python3.7/site-packages/dask_cuda/device_host_file.py", line 139, in __setitem__
self.host_buffer[key] = value
File "/raid/vjawa/conda_install/conda_env/envs/cudf_march_30/lib/python3.7/site-packages/zict/buffer.py", line 84, in __setitem__
if self.weight(key, value) <= self.n:
TypeError: '<=' not supported between instances of 'int' and 'NoneType'
Env:
dask-cuda 0.14.0a200330 py37_35 rapidsai-nightly
Work around:
Setting it to auto
works.
cluster = LocalCUDACluster(memory_limit='auto')
Other details:
This used to work earlier
dask-cuda 0.13.0b200329 py37_86 rapidsai-nightly
CC: @ayushdg , who triaged this.
Issue Analytics
- State:
- Created 3 years ago
- Comments:17 (17 by maintainers)
Top Results From Across the Web
Troubleshooting cluster issue with Event ID 1135
Start Page. Event ID 1135 indicates that one or more Cluster nodes were removed from the active failover cluster membership. It may be ......
Read more >19c: ASM startup fails during cluster startup - My Oracle Support
Symptoms. ASM fails to start on 19c resulting the cluster stack startup failure on one of the nodes. Due to this crsd fails...
Read more >Jira node fails to start due to cluster lock in the active objects
Problem. This only affects Jira Data Center. Node restart causes the node to be stuck during application startup due to another node holding ......
Read more >Minimizing downtime in ElastiCache for Redis with Multi-AZ
In this scenario, all the data in the cluster is lost due to the failure of ... the entire cluster failed, data is...
Read more >Manage the CyberArk Digital Cluster Vault Server
Starting failed – The Cluster Vault Manager service fails to start all the resources on the passive node. The administrator must determine which...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thanks for the clarification. Agreed that a discussion for #270 around device_limits would also be useful. To clarify my discussion w.r.t.
auto
, I am referring to defaults (auto) with host memory.Glad it’s clear now!
You were seeing this with dask-cuda-worker, or are you still seeing it? If you’re still seeing that, there may still be a bug.