Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Decouple "Dask" and Environments

See original GitHub issue

Currently prefect cloud/server flow runs are configured by 3 concepts:

Environment
Executor
Storage

while local flow runs are configured with an Executor alone.

Subclasses of Environment and Executor both exist with “dask” in their name, which can make things confusing, especially since the usage differs between local and cloud/server usage.

For example, if you’re trying to run a flow on kubernetes using Dask:

If you’re running locally, you’d use dask-kubernetes to spin up a dask cluster, then configure a DaskExecutor to use that cluster, then pass that executor to flow.run.
If you’re running using cloud, you’d configure a DaskKubernetesEnvironment and attach it to your flow. Environments also have executors, which makes this confusing, since you have a Dask* thing inside a Dask* thing.

If you then decide you want to not use a dask executor but still run on cloud, you need to change your environment to a KubernetesJobEnvironment. But you still can use a dask executor inside the KubernetesJobEnvironment? I find the current mixing of concepts a bit confusing - it feels like environments are encroaching in executors’ space.

I think a potentially cleaner way of specifying flow execution would be to remove anything “dask” related from environments entirely, and move them to executors alone. The separation of duties would be:

Environment:

Configures where and how to run the process that starts up a FlowRunner. This may be a kubernetes job, a docker container, a local process, etc…
Executor:

Configures where and how to run tasks inside a single flow run. An executor could run things without dask, with dask locally, connect to a remote dask cluster, or start a new temporary dask cluster just for this flow run. This is basically how things work now (minus the cluster creation).

One nice thing about this is that the only difference between running a flow locally and running a flow on cloud is the addition of the Environment. If you want to use a dask cluster on kubernetes to run your flow, in both cases you configure an Executor to spin up a dask cluster using dask-kubernetes. The Environment is only responsible for managing the initial process, not the dask processes.

If people like this plan, I think a way forward would be:

Create an Executor class that manages a dask cluster lifetime (starting/stopping/scaling) using the standard dask cluster interface (similar to #2503, which I think is superseded by this issue). This might look like:
```
executor = DaskExecutor(
    cluster_class="dask_cloudprovider.FargateCluster",
    cluster_kwargs={
        "image": "my-fancy-prefect-image",
    },
    ...
)
...
```
The lifetime of the dask cluster created by this executor would be the lifetime of the flow run. If we needed to have specific classes for e.g. DaskKubernetesExecutor we could, but if possible I’d prefer to keep this generic (as specified in #2503).

Deprecate the existing Dask*Environment classes, with instructions for how to move to using the new executor. For DaskKubernetesEnvironment for example, an equivalent configuration might be:

environment = KubernetesJobEnvironment(
    executor_class=DaskExecutor,
    executor_kwargs=dict(
        cluster_class="dask_kubernetes.KubeCluster",
        cluster_kwargs={...}
    )
)
...

Thoughts?

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:8

Top GitHub Comments

4reactions

joeschmidcommented, May 7, 2020

@jcrist The decoupling of these makes sense. It would also help to keep parity between Core and Cloud execution options, e.g. right now Dask Cloud Provider is an option for an Environment with Cloud but not as an Executor with Core. I really like this idea. (I agree that this issue should supersede https://github.com/PrefectHQ/prefect/issues/2503. I’ll go comment there now, but for me it’s ok to close that one in favor of this.)

2reactions

jcristcommented, May 28, 2020

Create an Executor class that manages a dask cluster lifetime (starting/stopping/scaling) using the standard dask cluster interface

This is done in #2667.

Top Results From Across the Web

Manage environments - Dask documentation

If you manage your environments yourself, then setting up module consistency can be as simple as creating environments from the same pip or...

Deploy Dask Clusters - Dask documentation

If you are just getting started then you can save this page for later as Dask runs perfectly well on a single machine...

Modin vs. Dask DataFrame vs. Koalas - Read the Docs

Enforcing ordering on a parallel dataframe system like Modin requires non-trivial effort that involves decoupling of the logical and physical representation of ...

Tight coupling between workers and schedulers - Dask Forum

In term of mechanisms, you can perfectly decouple Scheduler and Workers environment if you want, but this is not advisable as you can...

What's new and changed in IBM Spectrum Conductor?

New to IBM Spectrum Conductor 2.5.0 is Dask support. You can use the Dask component with your instance groups, and manage Dask versions...