question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

GCP cluster connection timeout over external IP

See original GitHub issue

Very excited to see https://github.com/dask/dask-cloudprovider/pull/131 in! I gave this a try again today @quasiben but had an issue. I ran this:

from dask_cloudprovider.gcp.instances import GCPCluster
cluster = GCPCluster(
    name='dask-gcp-test-1', 
    zone='us-east1-c', 
    machine_type='n1-standard-8', 
    projectid=MY_PROJECT_ID,
    docker_image="daskdev/dask:latest",
    ngpus=0
)
Launching cluster with the following configuration: 
  Source Image: projects/ubuntu-os-cloud/global/images/ubuntu-minimal-1804-bionic-v20201014 
  Docker Image: daskdev/dask:latest 
  Machine Type: n1-standard-8 
  Filesytsem Size: 50 
  N-GPU Type:  
  Zone: us-east1-c 
Creating scheduler instance
dask-7c88f69a-scheduler
	Internal IP: 10.142.0.2
	External IP: 34.75.186.200
Waiting for scheduler to run
# No connection made after ~10 mins so I killed the process

The scheduler connection never occurred despite this being run from a VM in GCP. That would make sense if it’s trying to connect with the external IP and I found that if I try to connect directly that way (via dask.distributed.Client), this also doesn’t work. I have no firewall rules set up to allow ingress to 8786 and I’d rather not add them. Is there a way to have GCPCluster or its parent classes use internal IPs instead? I would much prefer that. If I try to connect using the internal IP in this case (again via dask.distributed.Client), everything is fine so I know the VM is up and running correctly.

How did you get around this in your testing? Did you configure firewall rule excepts for the dask scheduler port or perhaps already have them in place?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:17 (17 by maintainers)

github_iconTop GitHub Comments

1reaction
jacobtomlinsoncommented, Nov 13, 2020

This is really useful feedback thanks!

You may find this blog post of interest as it discusses the state of Dask cluster deployments and was written earlier this year. Particularly the section on ephemeral vs fixed clusters.

One large assumption in dask-cloudprovider is that it provides ephemeral clusters. Today we do not have native cloud fixed cluster options. For GCP a fixed cluster setup may look like a cloud deployment manager template.

Comparing that with Kubernetes we have dask-kubernetes for ephemeral deployments and the Dask helm-chart for fixed deployments.

I am imagining that I can simply have hooks in my code that create and destroy a cluster for every script (or fairly large unit of work)

This is typically the intended use case for ephemeral Dask clusters.

though that seems less ideal to me than having it be possible to provide a scheduler address to the script

I would be interested to know why you see this as less ideal? Primary reasons I can think of is that the cluster startup is too slow and your scripts are small, fast and numerous.

Is this a type of usage that has been requested before?

The type of usage you are referring to is a fixed cluster. There is demand for fixed clusters in the Dask ecosystem, but less so than ephemeral clusters. Typically because fixed clusters can be wasteful in terms of money or credits. I see more fixed clusters running on in-house hardware where the cost is fixed up front.

The most compelling argument I have for that use case is in pipeline steps/scripts that would create dask clients in a deployment-agnostic manner

Once concern with this approach is that you may end up with multiple clients sharing a single cluster. This is not recommended. Dask does not differentiate between clients or have any concepts of queueing or fair usage. There should always be a one to one relationship between clusters and clients.

0reactions
jacobtomlinsoncommented, Nov 17, 2020

One last thought re: ephemeral clusters tied to a single process – debugging them is impossible when useful information is in the dask worker logs after a failure.

This is a good point. It would perhaps be useful to print cluster.get_logs() in the event of a failure before the process exits. But that would only work for certain failure modes.

Read more comments on GitHub >

github_iconTop Results From Across the Web

External HTTP(S) Load Balancing overview - Google Cloud
The timeout for a WebSocket connection depends on the configurable backend service timeout of the load balancer, which is 30 seconds by default....
Read more >
Cant connect to GKE cluster with kubectl. getting timeout
First check if you cluster IP is public and if yes, then you need to add a firewall rule which allows the traffic...
Read more >
Resolve connection timeouts when I connect Service to EKS
The cluster IP service type is used for communication between ... over the internet, make sure that your nodes have a Public IP...
Read more >
Resolving network connectivity issues between Google Cloud ...
Learn how to connect to peered private clusters and manage services such as Cloud-SQL and GKE without public IP addresses.
Read more >
Service | Kubernetes
Kubernetes gives Pods their own IP addresses and a single DNS name for a set of ... other kinds of backends, including ones...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found