Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

GCP cluster connection timeout over external IP

See original GitHub issue

Very excited to see https://github.com/dask/dask-cloudprovider/pull/131 in! I gave this a try again today @quasiben but had an issue. I ran this:

from dask_cloudprovider.gcp.instances import GCPCluster
cluster = GCPCluster(
    name='dask-gcp-test-1', 
    zone='us-east1-c', 
    machine_type='n1-standard-8', 
    projectid=MY_PROJECT_ID,
    docker_image="daskdev/dask:latest",
    ngpus=0
)
Launching cluster with the following configuration: 
  Source Image: projects/ubuntu-os-cloud/global/images/ubuntu-minimal-1804-bionic-v20201014 
  Docker Image: daskdev/dask:latest 
  Machine Type: n1-standard-8 
  Filesytsem Size: 50 
  N-GPU Type:  
  Zone: us-east1-c 
Creating scheduler instance
dask-7c88f69a-scheduler
	Internal IP: 10.142.0.2
	External IP: 34.75.186.200
Waiting for scheduler to run
# No connection made after ~10 mins so I killed the process

The scheduler connection never occurred despite this being run from a VM in GCP. That would make sense if it’s trying to connect with the external IP and I found that if I try to connect directly that way (via dask.distributed.Client), this also doesn’t work. I have no firewall rules set up to allow ingress to 8786 and I’d rather not add them. Is there a way to have GCPCluster or its parent classes use internal IPs instead? I would much prefer that. If I try to connect using the internal IP in this case (again via dask.distributed.Client), everything is fine so I know the VM is up and running correctly.

How did you get around this in your testing? Did you configure firewall rule excepts for the dask scheduler port or perhaps already have them in place?

Issue Analytics

State:
Created 3 years ago
Comments:17 (17 by maintainers)

Top GitHub Comments

1reaction

jacobtomlinsoncommented, Nov 13, 2020

This is really useful feedback thanks!

You may find this blog post of interest as it discusses the state of Dask cluster deployments and was written earlier this year. Particularly the section on ephemeral vs fixed clusters.

One large assumption in dask-cloudprovider is that it provides ephemeral clusters. Today we do not have native cloud fixed cluster options. For GCP a fixed cluster setup may look like a cloud deployment manager template.

Comparing that with Kubernetes we have dask-kubernetes for ephemeral deployments and the Dask helm-chart for fixed deployments.

I am imagining that I can simply have hooks in my code that create and destroy a cluster for every script (or fairly large unit of work)

This is typically the intended use case for ephemeral Dask clusters.

though that seems less ideal to me than having it be possible to provide a scheduler address to the script

I would be interested to know why you see this as less ideal? Primary reasons I can think of is that the cluster startup is too slow and your scripts are small, fast and numerous.

Is this a type of usage that has been requested before?

The type of usage you are referring to is a fixed cluster. There is demand for fixed clusters in the Dask ecosystem, but less so than ephemeral clusters. Typically because fixed clusters can be wasteful in terms of money or credits. I see more fixed clusters running on in-house hardware where the cost is fixed up front.

The most compelling argument I have for that use case is in pipeline steps/scripts that would create dask clients in a deployment-agnostic manner

Once concern with this approach is that you may end up with multiple clients sharing a single cluster. This is not recommended. Dask does not differentiate between clients or have any concepts of queueing or fair usage. There should always be a one to one relationship between clusters and clients.

0reactions

jacobtomlinsoncommented, Nov 17, 2020

One last thought re: ephemeral clusters tied to a single process – debugging them is impossible when useful information is in the dask worker logs after a failure.

This is a good point. It would perhaps be useful to print cluster.get_logs() in the event of a failure before the process exits. But that would only work for certain failure modes.