question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Starting the cluster with memory_limit=None causes failures on the latest nightly

See original GitHub issue

Starting the cluster with memory_limit=None causes failures on the latest nightly

Minimal Repro:

from dask_cuda import LocalCUDACluster
from dask.distributed import Client, wait
import dask 
 
def test_func():
    return "abc"
        

if __name__ == "__main__":
    cluster = LocalCUDACluster(memory_limit=None)
    client = Client(cluster)
    
    test_val = client.submit(test_func)
    print(test_val.result())

Trace:

Traceback (most recent call last):
  File "test_bug.py", line 14, in <module>
    print(test_val.result())
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_march_30/lib/python3.7/site-packages/distributed/client.py", line 220, in result
    raise exc.with_traceback(tb)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_march_30/lib/python3.7/site-packages/dask_cuda/device_host_file.py", line 139, in __setitem__
    self.host_buffer[key] = value
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_march_30/lib/python3.7/site-packages/zict/buffer.py", line 84, in __setitem__
    if self.weight(key, value) <= self.n:
TypeError: '<=' not supported between instances of 'int' and 'NoneType'

Env:

dask-cuda                 0.14.0a200330           py37_35    rapidsai-nightly

Work around:

Setting it to auto works.

    cluster = LocalCUDACluster(memory_limit='auto')

Other details:

This used to work earlier

dask-cuda                 0.13.0b200329           py37_86    rapidsai-nightly

CC: @ayushdg , who triaged this.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:17 (17 by maintainers)

github_iconTop GitHub Comments

1reaction
ayushdgcommented, Mar 31, 2020

Just to be clear memory_limit refers to host memory. There is a separate device_memory_limit for device memory, which we have discussed extending the same functionality too ( #270 ).

Thanks for the clarification. Agreed that a discussion for #270 around device_limits would also be useful. To clarify my discussion w.r.t. auto, I am referring to defaults (auto) with host memory.

0reactions
pentschevcommented, Apr 1, 2020

Thanks for the clarification. You are right about the memory and we assign the correct values as expected.

Glad it’s clear now!

I was seeing this behavior with dask-cuda-worker and assumed the same would be happening with LocalCUDACluster as well.

You were seeing this with dask-cuda-worker, or are you still seeing it? If you’re still seeing that, there may still be a bug.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshooting cluster issue with Event ID 1135
Start Page. Event ID 1135 indicates that one or more Cluster nodes were removed from the active failover cluster membership. It may be ......
Read more >
19c: ASM startup fails during cluster startup - My Oracle Support
Symptoms. ASM fails to start on 19c resulting the cluster stack startup failure on one of the nodes. Due to this crsd fails...
Read more >
Jira node fails to start due to cluster lock in the active objects
Problem. This only affects Jira Data Center. Node restart causes the node to be stuck during application startup due to another node holding ......
Read more >
Minimizing downtime in ElastiCache for Redis with Multi-AZ
In this scenario, all the data in the cluster is lost due to the failure of ... the entire cluster failed, data is...
Read more >
Manage the CyberArk Digital Cluster Vault Server
Starting failed – The Cluster Vault Manager service fails to start all the resources on the passive node. The administrator must determine which...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found