question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

"Failed to load task run error" occur in "on_failure_callback" after upgrading from 2.0.0 to 2.1.2

See original GitHub issue

Apache Airflow version: 2.1.2

Kubernetes version (if you are using kubernetes) (use kubectl version): “v1.19.6”

Environment:

  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

What happened: “Failed to load task run error” occur in “on_failure_callback” after task failed

What you expected to happen: The actually exception should be caught

How to reproduce it:

def validate_parameters(**kwargs):
    conf = kwargs.get('dag_run').conf
    kwargs['ti'].xcom_push(key='conf', value=conf)
    job_identifier = conf.get('job_identifier')
    s3_connection_id = conf.get('s3_connection_id')
    if not job_identifier:
        raise ParameterValidationException(reason='job_identifier should not be empty')
    if not s3_connection_id:
        raise ParameterValidationException(reason='s3_connection_id should not be empty')
  

def handle_failure(context):
    exception = context.get('exception')
    // the ParameterValidationException should be caught here, but I got "Failed to load task run error"
    // other business
    

DEFAULT_ARGS = {
    'owner': 'boss_admin',
    'start_date': datetime(2021, 1, 2),
    "retries": 0,
    'on_failure_callback': handle_failure
}

Anything else we need to know:

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:8 (3 by maintainers)

github_iconTop GitHub Comments

4reactions
Calder-Tycommented, Aug 30, 2021

Bottom Line Up Front: I had a similar issue, and found that the cause was a third party packages code that did not implement exceptions that could be loaded from a picklefile. This is the most likely cause of the problem.

For me, switching to Airflows official helm wasn’t an option (not running Kubernetes). I did some digging around and found that this message is spawned when trying to load an original exception from a pickle file. https://github.com/apache/airflow/blob/9922287a4f9f70b57635b04436ddc4cfca0e84d2/airflow/models/taskinstance.py#L136-L147

Unfortunately the causing error message is not logged. I applied this patch to my instance of airflow to see what error was causing the pickle.loads to fail.

@@ -115,7 +115,10 @@
         return None
     try:
         return pickle.loads(data)
-    except Exception:  # pylint: disable=broad-except
+    except Exception as e:  # pylint: disable=broad-except
+        log.exception("Failed to load the exception! Oh NO!!! Here is why")
+        log.exception(e.msg)
+        log.exception(e)
         return "Failed to load task run error"

I found that the error was being caused because the third party tooling I was using had an error that could be pickled, but then not loaded back. This is a common issue with custom errors in python. I’m guessing that something similar is happening in your case.

1reaction
kaojunsongcommented, Apr 28, 2022

@Calder-Ty Thanks a lot, I have fixed my issue by adding def __reduce__(self): return ParameterValidationException, (self.reason, self.component) to my custom Exception, as the pickle library needs to unpicle data from log files. https://www.synopsys.com/blogs/software-security/python-pickling/#:~:text=Whenever an object is pickled,reconstruct this object when unpickling.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[GitHub] [airflow] potiuk commented on issue #17045
... potiuk commented on issue #17045: "Failed to load task run error" occur in "on_failure_callback" after upgrading from 2.0.0 to 2.1.2.
Read more >
Failed to load task (Execute Processes Task)
Created a job on SQL Server 2012 which runs the upgraded package from ... the package is an EXECUTE PROCESS TASK, it says...
Read more >
airflow - How is exception passed to on_failure_callback?
You can define on_failure_callback on the DAG and on the task level. Exceptions are only passed to the failure callback on the task...
Read more >
Callbacks — Airflow Documentation
A valuable component of logging and monitoring is the use of task callbacks to act upon changes in state of a given task,...
Read more >
LM_44127 Failed to prepare the task (Session task instance)
I have problems to start any workflow in Informatica PowerCenter. Do you have any idea what may be wrong? When we started workflow,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found