question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Best practices for remotely connecting to dask worker cluster

See original GitHub issue

Our objective: Have a spot worker cluster with default 0 nodes running on AWS EKS. We hope to be able to connect to and use this cluster from a variety of places including:

  • containers that are part of aws batch jobs
  • kubernetes pods from other kubernetes clusters
  • laptops
  • etc

After looking through the documentation, I imagine that from the perspective of the environment that wishes to use the dask cluster, the workflow would be something like: export DASK_KUBERNETES__HOST="https://__aws_identifier__.us-west-2.eks.amazonaws.com" export DASK_KUBERNETES__NAME="dask-worker-cluster" (my understanding is environmental variables overwrite the values normally set here)

and then within the script/notebook: cluster = KubeCluster.from_yaml('worker-values.yaml')

  • Spawn a local dask scheduler

cluster.scale_up(50)

  • Call the EKS (in this example) dask cluster control plane and ask for 50 spot instances to be spun up

When I run the above commands from a jupyter notebook on my laptop, no errors are thrown, but no additional ec2 machines are spun up.

I should note that using eksctl I have successfully scaled up and down nodes on this cluster. Also, I confirmed that kubectl cluster-info on my laptop returns the correct API endpoint listed on this EKS cluster in the AWS console.

The good folks at Pangeo explained to me that in their default deployments, a dask scheduler is attached to the same (on-demand) pod that runs jupyter. In such a configuration, multiple dask schedulers can request resources from spot/pre-emptible dask workers.

My thinking is that what I described above roughly follows the same design pattern as Pangeo, however, if I’m correctly interpreting the discussion here and more specifically, @mrocklin’s comment here, maybe a dedicated scheduler(s) would need to be running within the k8s cluster?

However, based off of this statement:

    Since the Dask scheduler is launched locally, for it to work, we need to
    be able to open network connections between this local node and all the
    workers nodes on the Kubernetes cluster. If the current process is not
    already on a Kubernetes node, some network configuration will likely be
    required to make this work.

I’m hoping that my usecase of local scheduler + remote k8s workers (+said network configs) should be possible.

My questions:

  • What network configuration is required in order for a local dask scheduler to connect to a remote worker cluster?
  • If this is possible, is there a page I missed that might include some best practices about how to configure such a setup? If not, I’d be happy to submit something to the docs.
  • In terms of how Dask works, is the design pattern I’ve described the best way to support my usecases?

Thanks in advance for suggestions or advice you might be able to offer. 😃

Additional info: dask = 1.1.1 dask-kubernetes=0.9 kubectl client version = v1.15.0 os = Pop!_OS (ubuntu) 18.04 LTS

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
jcristcommented, Jul 2, 2019

Correct, although there are likely other ways to solve this. Cloud providers like AWS provide ways to restrict in/out traffic to local certain IPs, so I think there should be a way to expose the workers without opening them to the whole internet. With #84 (scheduler on the cluster) this would be easier as you’d only need to expose the scheduler. I’m not familiar enough with AWS or kubernetes to know exactly what network configuration would be needed for either option (@jacobtomlinson may know).

In the long term dask-gateway intends to address this use case, where an admin sets up a cluster that users can connect to and create clusters. In this case you’d have a small node running the gateway instance, and spot instances for running user clusters as they are requested. If you’re interested, we hope to have a helm chart for trying this out in the near-ish future. Docs live here: https://jcrist.github.io/dask-gateway/

0reactions
jacobtomlinsoncommented, Oct 14, 2019

Now that we have a workable implementation fo #84 and dask-gateway is getting more mature I’m going to close this with a couple of different routes you could explore.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Connect to remote data - Dask documentation
Dask uses fsspec for local, cluster and remote data IO. Other file interaction, such as loading of configuration, is done using ordinary python...
Read more >
Deploy Dask Clusters - Dask documentation
We recommend using dask.distributed clusters at all scales for the following reasons: It provides access to asynchronous API, notably Futures.
Read more >
Distributed - spread your data and computation across a cluster
Remote clusters via SSH¶ ... A common way to distribute your work onto multiple machines is via SSH. Dask has a cluster manager...
Read more >
SSH - Dask documentation
The SSHCluster function deploys a Dask Scheduler and Workers for you on a set of machine addresses that you provide. The first address...
Read more >
High Performance Computers - Dask documentation
Most of this page documents various ways and best practices to use Dask on an HPC cluster. This is technical and aimed both...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found