question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Data Fusion Hook Start pipeline will succeed before pipeline is in RUNNING state

See original GitHub issue

(re-opening of #8672 with appropriate template) CC: @turbaszek Apache Airflow version: master

Kubernetes version (if you are using kubernetes) (use kubectl version):

Environment: Composer (this is not unique to composer it’s a code logic issue)

  • Cloud provider or hardware configuration: gcp
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

What happened: Here the start_pipeline method of data fusion hook will succeed if they get a 200 from the CDAP API. This is a misleading success signal as this indicates at best that this pipeline entered the PENDING state. However start pipeline should not succeed until the pipline has reached the RUNNING state. Note the Happy path is PENDING > STARTING> RUNNING (ProgramStatus) Many CDAP pipelines using Dataproc Provisioner spend a significant amount of time in the STARTING state because they also have tick through the various ProgramRunClusterStatus for provisioning the dataproc cluster.

What you expected to happen: start_pipeline and the associated operator should not succeed until pipeline is started (enters RUNNING state).

How to reproduce it: This is a code logic issue and will be reproduced by any use of this method. If you want to demonstrate why this is problematic.

  • Run a dag that runs CloudDataFusionStartPipelineOperator
  • Do any of the following:
    • Hop over to Data Fusion UI and manually stop the pipeline before it enters the running state
    • Manually delete the dataproc cluster before it finishes provisioning
    • Use a Compute Profile that tries to provision an illegal dataproc cluster (e.g. due to permissions issue where CDF SA doesn’t have sufficient permission to create cluster in another project)
  • observe that CloudDataFusionStartPipelineOperator task succeeds (even though pipeline never really started).

We could catch things like this in the future by adding a system test that uses CloudDataFusionStartPipelineOperator (and the test immediately kills it using the stop endpoint) and asserts that the task fails.

Anything else we need to know: Unfortunately making the start call to CDAP does not return a run_id to poll for state.

This hook could work around this by adding a special runtime arg e.g. __faux_airflow_id__ which can be used to “look up” the real run id by this special runtime arg. the value of this runtime arg could be the dag_run_id or something. If using this workaround or CDAP API can return run id, then a more useful operator than start pipeline would be one that actually waits til the job reaches a success state (much like the existing dataflow and dataproc operators).

Example in golang for terraform provider resource that manages a streaming pipeline. Creating with faux id

And looking up the real CDAP run id by faux id

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
jaketfcommented, May 8, 2020

I will reach out to my team and see if there are folks interested in joining airflow community by patching this

0reactions
mik-lajcommented, May 19, 2020

I added a good first issue label. We have a detailed description of the problem, and this only affects one operator, so this can be an interesting task for a beginner.

Read more comments on GitHub >

github_iconTop Results From Across the Web

airflow.providers.google.cloud.hooks.datafusion
Retrieves connection to DataFusion. Restart a single Data Fusion instance. At the end of an operation instance is fully restarted. instance_name (str) –...
Read more >
Downstream pipelines - GitLab Docs
A downstream pipeline is any GitLab CI/CD pipeline triggered by another pipeline. Downstream pipelines run independently and concurrently to the upstream ...
Read more >
Targeting campaign pipeline - Cloud Data Fusion
This tutorial shows you how to use Cloud Data Fusion to clean, transform, and process customer data to select candidates for a targeting...
Read more >
Pipelines and activities - Azure Data Factory & Azure Synapse
Category Data store Supported as a source Supported by Azure IR Su... Azure Azure Blob storage ✓ ✓ ✓ Azure Cognitive Search index ✓ ✓ Azure...
Read more >
Invoke an AWS Lambda function in a pipeline in CodePipeline
Using a continuation token to monitor a long-running asynchronous process ( continue_job_later ). This allows the action to continue and the function to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found