question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Support cancellation of active flow runs

See original GitHub issue

Currently flow runs can be transitioned into a Cancelled state, but that only prevents new tasks in the flow run from starting - it does nothing to stop already running tasks (it’s also not exposed nicely in the UI).

In practice this is hard to do effectively in all situations, there is no straightforward way to interrupt an arbitrary python function in all contexts. If running in process, we can interrupt the main thread, but not other threads. If running in a subprocess we can kill the process. If running in a dask cluster we can cancel pending work, but not stop running work - but with #2667 the DaskExecutor can now manage the lifetime of a dask cluster, so we could have the executor shutdown the cluster as a hard stop.

Given the above restrictions, it seems we could add more support for cancelling flow runs with the following semantics:

  • Cancellation of a flow run starts when flow runs are transitioned into a Cancelling state, where Cancelling is a subclass of Running. This indicates that the flow run is in the process of cancelling, but still is technically Running. Transitioning directly into Cancelled from Running skips a necessary intermediate state and should be deprecated.
  • Cancellation is still “best effort”:
    • If a task is started, we’ll cancel it if possible, otherwise it may run until termination
    • If a task hasn’t started, it won’t be started

Focussing on how this works with orchestration, the following takes place after a user cancels a running flow:

  • The flow run is moved to Cancelling
  • Once in Cancelling , TaskRunner interactions with cloud/server (either heartbeats or state updates) will let the task runner know that the task should be cancelled. If possible, a running task will be interrupted, otherwise it will be allowed to finish. Tasks that are allowed to finish will transition to their final state as if cancellation hadn’t occurred (e.g. Success or Failed). This provides a more accurate representation of the cancelled flow run. Tasks that are successfully cancelled are marked as Cancelled.
  • Likewise, FlowRunner interactions with cloud (either heartbeats or state updates) will let the flow runner know that the flow should be cancelled. Any tasks that haven’t been submitted yet will transition directly to Cancelled.
    • When the FlowRunner notices that a flow run has been cancelled, it will call a new method on the Executor to ask the executor to stop any ongoing work if possible. Tentatively I’ll call this method executor.cancel. For the DaskExecutor, if the executor is managing a temporary cluster (newly enabled in #2667) the cluster will be shutdown, effectively cancelling any ongoing tasks. Otherwise a warning will be output (that pending tasks can’t be stopped) and any ongoing work completed. For the local executor we might try to interrupt the main thread, or we may just wait for the task to finish. As stated above, cancellation is best-effort, and doesn’t guarantee that ongoing work will be interrupted.
  • Once all tasks in the flow have transitioned to a Finished state (Cancelled being a subclass of Finished), the flow run will transition from Cancelling to Cancelled.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
jcristcommented, Jun 12, 2020

Good question. I think that cancelling a flow run shouldn’t hard-kill the FlowRunner - I think we want a more graceful shutdown. IMO the flow runner should:

  • Stop submitting new work
  • Ask the Executor to cancel the whole execution if possible

In the case of a temporary dask cluster, this would shutdown the dask cluster. When the flow runner exits, the initial job (e.g. a fargate task) would exit. Hard killing the FlowRunner may not allow sufficient time to cleanup resources, and could leave things in a bad state, I think this should be handled at the Executor level.

1reaction
dylanbhughescommented, Jun 12, 2020

@jcrist For Environments/Executors using AWS Fargate or Kubernetes Jobs, do you think this API would also be able to stop Flow Runs by shutting down the underlying infrastructure? It seems to be a common feature request. See: #2766

Read more comments on GitHub >

github_iconTop Results From Across the Web

Cancelling all running flows in Power Automate - 365HQ
Cancel all the running flows · Navigate to the flow in question · Click on “All runs” · Filter the view to only...
Read more >
How to cancel multiple runs in a flow?
Deleting this flow will remove it for all owners and uninstall it for all users. Previous flow instances will continue to run to...
Read more >
How to cancel running Power Automate flows with JavaScript
In this post, we take a look at how you can cancel running Power Automate flows with JavaScript. Instead of manually canceling one...
Read more >
Cancel all running flow runs for a flow in an environment
This script will cancel all running flow runs of a Power Automate flow created in an environment. Pass the Flow environment id and...
Read more >
Cancel all your running Power Automate flow runs using ...
You can run the M365 CLI commands stored in a file like PowerShell cmdlets. Find below the M365 CLI cmdlets stored in a...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found