Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Starting the cluster with memory_limit=None causes failures on the latest nightly

See original GitHub issue

Minimal Repro:

from dask_cuda import LocalCUDACluster
from dask.distributed import Client, wait
import dask 
 
def test_func():
    return "abc"
        

if __name__ == "__main__":
    cluster = LocalCUDACluster(memory_limit=None)
    client = Client(cluster)
    
    test_val = client.submit(test_func)
    print(test_val.result())

Trace:

Traceback (most recent call last):
  File "test_bug.py", line 14, in <module>
    print(test_val.result())
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_march_30/lib/python3.7/site-packages/distributed/client.py", line 220, in result
    raise exc.with_traceback(tb)
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_march_30/lib/python3.7/site-packages/dask_cuda/device_host_file.py", line 139, in __setitem__
    self.host_buffer[key] = value
  File "/raid/vjawa/conda_install/conda_env/envs/cudf_march_30/lib/python3.7/site-packages/zict/buffer.py", line 84, in __setitem__
    if self.weight(key, value) <= self.n:
TypeError: '<=' not supported between instances of 'int' and 'NoneType'

Env:

dask-cuda                 0.14.0a200330           py37_35    rapidsai-nightly

Work around:

Setting it to auto works.

    cluster = LocalCUDACluster(memory_limit='auto')

Other details:

This used to work earlier

dask-cuda                 0.13.0b200329           py37_86    rapidsai-nightly

CC: @ayushdg , who triaged this.

Issue Analytics

State:
Created 3 years ago
Comments:17 (17 by maintainers)

Top GitHub Comments

1reaction

ayushdgcommented, Mar 31, 2020

Just to be clear memory_limit refers to host memory. There is a separate device_memory_limit for device memory, which we have discussed extending the same functionality too ( #270 ).

Thanks for the clarification. Agreed that a discussion for #270 around device_limits would also be useful. To clarify my discussion w.r.t. auto, I am referring to defaults (auto) with host memory.

0reactions

pentschevcommented, Apr 1, 2020

Thanks for the clarification. You are right about the memory and we assign the correct values as expected.

Glad it’s clear now!

I was seeing this behavior with dask-cuda-worker and assumed the same would be happening with LocalCUDACluster as well.

You were seeing this with dask-cuda-worker, or are you still seeing it? If you’re still seeing that, there may still be a bug.

Top Results From Across the Web

Troubleshooting cluster issue with Event ID 1135

Start Page. Event ID 1135 indicates that one or more Cluster nodes were removed from the active failover cluster membership. It may be ......

19c: ASM startup fails during cluster startup - My Oracle Support

Symptoms. ASM fails to start on 19c resulting the cluster stack startup failure on one of the nodes. Due to this crsd fails...

Jira node fails to start due to cluster lock in the active objects

Problem. This only affects Jira Data Center. Node restart causes the node to be stuck during application startup due to another node holding ......

Minimizing downtime in ElastiCache for Redis with Multi-AZ

In this scenario, all the data in the cluster is lost due to the failure of ... the entire cluster failed, data is...

Manage the CyberArk Digital Cluster Vault Server

Starting failed – The Cluster Vault Manager service fails to start all the resources on the passive node. The administrator must determine which...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Starting the cluster with memory_limit=None causes failures on the latest nightly

Minimal Repro:

Trace:

Env:

Work around:

Other details:

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

[FEA] Dynamic configuration of worker device memory limits

Old exceptions pop-up if `dask-worker-space` persists