question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Subworkflow recover mode is broken

See original GitHub issue

Describe the bug

When recovering this nested_parent_wf, the second t1 was incorrectly recovered with the subworkflow outputs, not the task outputs:

Screen Shot 2022-09-01 at 2 05 12 PM

We should expect the outputs to be c and t1_int_output Screen Shot 2022-09-01 at 2 05 56 PM

which corresponds to the t1 task signature and not the subworkflow signature: https://github.com/flyteorg/flytesnacks/blob/1b86bdd84cb55ad65c9c97256d91da0c1a3e2610/cookbook/core/control_flow/subworkflows.py#L67 as observed in the buggy recovered t1

Expected behavior

Subworkflow recovery should work correctly.

Additional context to reproduce

No response

Screenshots

No response

Are you sure this issue hasn’t been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
kumare3commented, Sep 3, 2022

cc @hamersaw

@katrogan this is a correctness issue. Can we get on this

0reactions
hamersawcommented, Sep 7, 2022

OK, this is actually pretty complex - and can confirm dynamic tasks are broken too, but maybe worse. As explained above when we attempt to recover the node execution we use the NodeID from the NodeExecutionMetadata. This means that nodes with parents will not include the parent info, which is not the same as how the node ID is set during eventing. The result is that recovery of the subworkflow node “n1” searches for “n1” in the top-level workflow execution id rather than “n0-0-n1” like it should. This fix is relatively easy, but unfortunately doesn’t cover all cases.

The real problem happens in dynamic tasks, because the parent info may include a retry (which in subworkflows will always be 0). If the dynamic task parent task fails (incrementing the retry to 1) then as subnodes execute the fully qualified ID will be prepended with “n0-1” for the parent node id and retry attempt. So in this example rather than “n0-0-n1” it would be “n0-1-n1”. If we attempt to recover this dynamic task, we will search for “n0-0” because it doesn’t know if the parent was retried or not (this may require an additional lookup in flyteadmin? and gets way more complex).

TL;DR quick fix is relatively easy - will not result in incorrect behavior, but in corner cases will recompute previously completed tasks. The correct fix is probably a bit more complex. cc @katrogan thoughts?

Read more comments on GitHub >

github_iconTop Results From Across the Web

3 Efficient Ways to Fix Android Recovery Mode Not Working
The most common reason to encounter recovery mode not working and getting no command error is that the Superuser access has been denied...
Read more >
If Your Android Stuck in Recovery Mode, Try These Solutions
Solution 2: Force Reboot Your Android Device. The easiest and direct method to fix the stuck in Recovery Mode Android issue is to...
Read more >
Sub Workflow Activity Overview
The sub workflow is another workflow that is defined in Cora SeQuence. Sub workflows are useful to promote encapsulation, such as when the...
Read more >
Sub-workflow | Adobe Campaign - Experience League
The Sub-workflow activity lets you trigger the execution of another workflow and recover the result. This activity lets you use complex ...
Read more >
Microsoft Dynamics AX Forum - Subworkflow issue
Hi All ,. I have created the ID -0098 Purchase requisition line workflow (treating as subworkflow) and created another one ID -0100 Purchase...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found