question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

LocalDagRunner vs AirflowDagRunner: how to setup Artifact names?

See original GitHub issue

System information

  • Linux
  • TensorFlow version: 2.8
  • TFX Version: 1.8
  • Python version: 3.8

Problem

I have created tfx pipeline and run it via LocalDagRunner().run(pipe) and AirflowDagRunner(AirflowPipelineConfig(AIRFLOW_ARGS)).run().

I have noticed that AirflowDagRunner saves artifacts in such format in the metadata store:

name post_transform_stats pre_transform_stats statistics

while LocalDagRunner saves artifact like this in the metadata store:

name ihw_spans:2022-06-28T14:54:32.196073:StatisticsGen:statistics:0 ihw_spans:2022-06-28T14:54:32.196073:Transform:post_transform_stats:0 ihw_spans:2022-06-28T14:54:32.196073:Transform:pre_transform_stats:0

ihw_spans:2022-06-28T15:10:55.220458:StatisticsGen:statistics:0 ihw_spans:2022-06-28T15:10:55.220458:Transform:post_transform_stats:0 hw_spans:2022-06-28T15:10:55.220458:Transform:pre_transform_stats:0

Thus when I take new data span via Airflow I get such error:

ml_metadata.errors.AlreadyExistsError: Given node already exists: type_id: 17
uri: "/home/larion/airflow/tfx/pipelines/ihw_spans/StatisticsGen/statistics/46"
custom_properties {
  key: "name"
  value {
    string_value: "statistics"
  }
}
custom_properties {
  key: "producer_component"
  value {
    string_value: "StatisticsGen"
  }
}
name: "statistics"

Cause metadata can’t save new span statistics with the same name. But when I run dag locally in the second or third time everything works fine, cause the name is composed by strings with date and thus it is always unique.

Is it possible to config naming strategy for pipelines? Is it possible to set artifact names in the metadata.Artifact table during components creation?

I do not understand how I will trigger tfx pipeline second and third times via airflow if there is such strange error.

Issue Analytics

  • State:open
  • Created a year ago
  • Reactions:1
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
sanatmpa1commented, Jul 14, 2022

@jiyongjung0, @1025KB

I’ve noticed a similar issue being raised in Stackoverflow yesterday. Please refer this link. Thanks!

0reactions
singhniraj08commented, Nov 14, 2022

@Daard,

Could you please try the suggested solution, by running the following method after the TFX installation and let us know if it works. sed -i 's/artifact.name = name/artifact.name = f"{name}:{pipeline_info.run_id}"/' /opt/conda/lib/python3.7/site-packages/tfx/dsl/components/base/base_driver.py

Thank you!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Issues · tensorflow/tfx · GitHub - Pizza Cravings Order Online
LocalDagRunner vs AirflowDagRunner: how to setup Artifact names ? stat:awaiting tensorflower type:bug. #4977 opened on Jun 29 by Daard.
Read more >
tfx.v1.orchestration.LocalDagRunner - TensorFlow
LocalDagRunner. Stay organized with collections Save and categorize content based on your preferences. Local TFX DAG runner.
Read more >
Demystifying TFX Standard Components · All things
Configuration framework: powers the configuration of TFX components ... A driver consumes artifact and the execution of the component ...
Read more >
Newest 'tfx' Questions - Stack Overflow
I ran into a problem using TFX, MLMD, and Apache-Airflow as the orchestrator. Local-dag-runner, provided by TFX, works fine, resulting in distinct artifacts...
Read more >
tfx Changelog - pyup.io
Populate Artifact proto `name` field when name is set on the Artifact ... Removed config from LocalDagRunner's constructor, and dropped pipeline proto
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found