question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Investigate reports of failed mapped tasks returning None to downstream tasks

See original GitHub issue

Description

I’ve heard from a contributor that an unstable mapping behavior occurs. The way I heard it was:

  • in a mapped pipeline
  • a dask worker unexpectedly dies
  • the downstream task unexpectedly runs and receives ‘None’ as input, causing a runtime error because of the weird input

I also found reports of this in our slack history (archived here: https://github.com/PrefectHQ/prefect/issues/2655) that implied a link to specific deployment environments and for high volume mapped pipelines.

Note: is this possibly related to https://github.com/PrefectHQ/prefect/issues/2430?

Expected Behavior

What did you expect to happen instead? The upstream mapped task is Failed, and the downstream mapped task does not run.

Reproduction

A minimal example that exhibits the behavior. I have not observed it myself yet, but based on the slack thread it seems a high volume mapping task on an unstable network using DaskKubernetesEnvironment is the best way to reproduce.

Environment

Any additional information about your environment

Optionally run prefect diagnostics from the command line and paste the information here

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:11 (9 by maintainers)

github_iconTop GitHub Comments

2reactions
jcristcommented, Jun 16, 2020

Without a reproducible example, I’m not sure how to progress on this, especially since it may have been resolved by the mapping refactor. +0.5 on closing if others are ok with it, since we don’t have an immediate action plan or reproducer.

1reaction
cicdwcommented, Jul 10, 2020

Good news everyone! I have a reproducible example of this behavior. @jcrist it’s for your favorite part of the codebase - results! It’s specific to the following situation:

  • zombies occur mid-way through a mapped pipeline on tasks that have retries
  • there is a reduce task immediately after the zombie-level

It appears that all data that was produced by the successfully mapped children prior to the zombie-death is not properly rehydrated on the other end whenever the process is resurrected for a retry.

Here’s the flow I used locally to test:

import prefect
from prefect import task, Flow

from datetime import timedelta
import os
import time
import sys


@task
def return_list():
    prefect.context['logger'].debug(f'PID: {os.getpid()}')
    return list(range(10))


@task(max_retries=2, retry_delay=timedelta(seconds=0))
def map_task(x):
    if x == 5:
        prefect.context['logger'].critical('Waiting: do it! do it!')
        time.sleep(20)
    return x


@task
def reducer(ll):
    msg = '\n'.join("{i}: {v}".format(i=i, v=v) for i, v in enumerate(ll))
    prefect.context['logger'].debug(msg)


with Flow("zombie") as flow:
    reducer(map_task.map(return_list))

Whenever I saw the waiting log I killed both the flow runner process as well as the heartbeat process for the task. After waiting for Cloud to do its thing, I then saw:

[2020-07-10 03:26:31] 807-- DEBUG - prefect.CloudTaskRunner | Task 'reducer': Calling task.run() method...
[2020-07-10 03:26:31] 29-- DEBUG - prefect.reducer | 0: None
1: None
2: None
3: None
4: None
5: None
6: None
7: 7
8: 8
9: 9

It appears that our load_results logic doesn’t quite work whenever the immediate upstream was a mapped task. I can resolve tomorrow 👍

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to inspect mapped tasks' inputs from reduce tasks in Prefect
Mapping over a list preserves the order. ... Just pass the list of inputs again and then we can match the inputs with...
Read more >
2 Server Error Message Reference - MySQL :: Developer Zone
InnoDB reports this error when a table cannot be created. If the error message refers to error 150, table creation failed because a...
Read more >
Schedule Flow Tasks - Tableau Help
Easily set up your flow list by selecting your schedule, then select downstream flows to run in the order you choose. In Tableau...
Read more >
CHAPTER 6 EMERGENCY ACTION PLANS
the Non-Failure and High Flow Conditions (see Section 6-3.2.2-2. ... The tasks and responsibilities of the licensee and the emergency management authorities.
Read more >
ASSESSMENT OF HIGHER EDUCATION LEARNING ... - OECD
items take the abbreviation of a constructed-response task, or CRT. ... result of lower returns to non-degree higher education compared to full degrees, ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found