GCP cluster connection timeout over external IP
See original GitHub issueVery excited to see https://github.com/dask/dask-cloudprovider/pull/131 in! I gave this a try again today @quasiben but had an issue. I ran this:
from dask_cloudprovider.gcp.instances import GCPCluster
cluster = GCPCluster(
name='dask-gcp-test-1',
zone='us-east1-c',
machine_type='n1-standard-8',
projectid=MY_PROJECT_ID,
docker_image="daskdev/dask:latest",
ngpus=0
)
Launching cluster with the following configuration:
Source Image: projects/ubuntu-os-cloud/global/images/ubuntu-minimal-1804-bionic-v20201014
Docker Image: daskdev/dask:latest
Machine Type: n1-standard-8
Filesytsem Size: 50
N-GPU Type:
Zone: us-east1-c
Creating scheduler instance
dask-7c88f69a-scheduler
Internal IP: 10.142.0.2
External IP: 34.75.186.200
Waiting for scheduler to run
# No connection made after ~10 mins so I killed the process
The scheduler connection never occurred despite this being run from a VM in GCP. That would make sense if it’s trying to connect with the external IP and I found that if I try to connect directly that way (via dask.distributed.Client), this also doesn’t work. I have no firewall rules set up to allow ingress to 8786 and I’d rather not add them. Is there a way to have GCPCluster
or its parent classes use internal IPs instead? I would much prefer that. If I try to connect using the internal IP in this case (again via dask.distributed.Client), everything is fine so I know the VM is up and running correctly.
How did you get around this in your testing? Did you configure firewall rule excepts for the dask scheduler port or perhaps already have them in place?
Issue Analytics
- State:
- Created 3 years ago
- Comments:17 (17 by maintainers)
Top GitHub Comments
This is really useful feedback thanks!
You may find this blog post of interest as it discusses the state of Dask cluster deployments and was written earlier this year. Particularly the section on ephemeral vs fixed clusters.
One large assumption in dask-cloudprovider is that it provides ephemeral clusters. Today we do not have native cloud fixed cluster options. For GCP a fixed cluster setup may look like a cloud deployment manager template.
Comparing that with Kubernetes we have dask-kubernetes for ephemeral deployments and the Dask helm-chart for fixed deployments.
This is typically the intended use case for ephemeral Dask clusters.
I would be interested to know why you see this as less ideal? Primary reasons I can think of is that the cluster startup is too slow and your scripts are small, fast and numerous.
The type of usage you are referring to is a fixed cluster. There is demand for fixed clusters in the Dask ecosystem, but less so than ephemeral clusters. Typically because fixed clusters can be wasteful in terms of money or credits. I see more fixed clusters running on in-house hardware where the cost is fixed up front.
Once concern with this approach is that you may end up with multiple clients sharing a single cluster. This is not recommended. Dask does not differentiate between clients or have any concepts of queueing or fair usage. There should always be a one to one relationship between clusters and clients.
This is a good point. It would perhaps be useful to print
cluster.get_logs()
in the event of a failure before the process exits. But that would only work for certain failure modes.