Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Support cancellation of active flow runs

See original GitHub issue

Currently flow runs can be transitioned into a Cancelled state, but that only prevents new tasks in the flow run from starting - it does nothing to stop already running tasks (it’s also not exposed nicely in the UI).

In practice this is hard to do effectively in all situations, there is no straightforward way to interrupt an arbitrary python function in all contexts. If running in process, we can interrupt the main thread, but not other threads. If running in a subprocess we can kill the process. If running in a dask cluster we can cancel pending work, but not stop running work - but with #2667 the DaskExecutor can now manage the lifetime of a dask cluster, so we could have the executor shutdown the cluster as a hard stop.

Given the above restrictions, it seems we could add more support for cancelling flow runs with the following semantics:

Cancellation of a flow run starts when flow runs are transitioned into a Cancelling state, where Cancelling is a subclass of Running. This indicates that the flow run is in the process of cancelling, but still is technically Running. Transitioning directly into Cancelled from Running skips a necessary intermediate state and should be deprecated.
Cancellation is still “best effort”:
- If a task is started, we’ll cancel it if possible, otherwise it may run until termination
- If a task hasn’t started, it won’t be started

Focussing on how this works with orchestration, the following takes place after a user cancels a running flow:

The flow run is moved to Cancelling
Once in Cancelling , TaskRunner interactions with cloud/server (either heartbeats or state updates) will let the task runner know that the task should be cancelled. If possible, a running task will be interrupted, otherwise it will be allowed to finish. Tasks that are allowed to finish will transition to their final state as if cancellation hadn’t occurred (e.g. Success or Failed). This provides a more accurate representation of the cancelled flow run. Tasks that are successfully cancelled are marked as Cancelled.
Likewise, FlowRunner interactions with cloud (either heartbeats or state updates) will let the flow runner know that the flow should be cancelled. Any tasks that haven’t been submitted yet will transition directly to Cancelled.
- When the FlowRunner notices that a flow run has been cancelled, it will call a new method on the Executor to ask the executor to stop any ongoing work if possible. Tentatively I’ll call this method executor.cancel. For the DaskExecutor, if the executor is managing a temporary cluster (newly enabled in #2667) the cluster will be shutdown, effectively cancelling any ongoing tasks. Otherwise a warning will be output (that pending tasks can’t be stopped) and any ongoing work completed. For the local executor we might try to interrupt the main thread, or we may just wait for the task to finish. As stated above, cancellation is best-effort, and doesn’t guarantee that ongoing work will be interrupted.
Once all tasks in the flow have transitioned to a Finished state (Cancelled being a subclass of Finished), the flow run will transition from Cancelling to Cancelled.

Issue Analytics

State:
Created 3 years ago
Comments:7 (1 by maintainers)

Top GitHub Comments

1reaction

jcristcommented, Jun 12, 2020

Good question. I think that cancelling a flow run shouldn’t hard-kill the FlowRunner - I think we want a more graceful shutdown. IMO the flow runner should:

Stop submitting new work
Ask the Executor to cancel the whole execution if possible

In the case of a temporary dask cluster, this would shutdown the dask cluster. When the flow runner exits, the initial job (e.g. a fargate task) would exit. Hard killing the FlowRunner may not allow sufficient time to cleanup resources, and could leave things in a bad state, I think this should be handled at the Executor level.

1reaction

dylanbhughescommented, Jun 12, 2020

@jcrist For Environments/Executors using AWS Fargate or Kubernetes Jobs, do you think this API would also be able to stop Flow Runs by shutting down the underlying infrastructure? It seems to be a common feature request. See: #2766