Investigate reports of failed mapped tasks returning None to downstream tasks
See original GitHub issueDescription
I’ve heard from a contributor that an unstable mapping behavior occurs. The way I heard it was:
- in a mapped pipeline
- a dask worker unexpectedly dies
- the downstream task unexpectedly runs and receives ‘None’ as input, causing a runtime error because of the weird input
I also found reports of this in our slack history (archived here: https://github.com/PrefectHQ/prefect/issues/2655) that implied a link to specific deployment environments and for high volume mapped pipelines.
Note: is this possibly related to https://github.com/PrefectHQ/prefect/issues/2430?
Expected Behavior
What did you expect to happen instead? The upstream mapped task is Failed, and the downstream mapped task does not run.
Reproduction
A minimal example that exhibits the behavior.
I have not observed it myself yet, but based on the slack thread it seems a high volume mapping task on an unstable network using DaskKubernetesEnvironment
is the best way to reproduce.
Environment
Any additional information about your environment
Optionally run prefect diagnostics
from the command line and paste the information here
Issue Analytics
- State:
- Created 3 years ago
- Comments:11 (9 by maintainers)
Without a reproducible example, I’m not sure how to progress on this, especially since it may have been resolved by the mapping refactor. +0.5 on closing if others are ok with it, since we don’t have an immediate action plan or reproducer.
Good news everyone! I have a reproducible example of this behavior. @jcrist it’s for your favorite part of the codebase - results! It’s specific to the following situation:
It appears that all data that was produced by the successfully mapped children prior to the zombie-death is not properly rehydrated on the other end whenever the process is resurrected for a retry.
Here’s the flow I used locally to test:
Whenever I saw the waiting log I killed both the flow runner process as well as the heartbeat process for the task. After waiting for Cloud to do its thing, I then saw:
It appears that our
load_results
logic doesn’t quite work whenever the immediate upstream was a mapped task. I can resolve tomorrow 👍