question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error in passing metadata to DataprocClusterCreateOperator

See original GitHub issue

Hi, I am facing some issues while installing PIP Packages in the Dataproc cluster using Initialization script, I am trying to upgrade to Airflow 2.0 from 1.10.12 (where this code works fine)

[2021-07-09 11:35:37,587] {taskinstance.py:1454} ERROR - metadata was invalid: [('PIP_PACKAGES', 'pyyaml requests pandas openpyxl'), ('x-goog-api-client', 'gl-python/3.7.10 grpc/1.35.0 gax/1.26.0 gccl/airflow_v2.0.0+astro.3')

 path = f"gs://goog-dataproc-initialization-actions-{self.cfg.get('region')}/python/pip-install.sh"
 
return DataprocClusterCreateOperator(
     ........
  init_actions_uris=[path],
  metadata=[('PIP_PACKAGES', 'pyyaml requests pandas openpyxl')],
    ............
      )

Apache Airflow version: airflow_v2.0.0

What happened: I am trying to migrate our codebase from Airflow v1.10.12, on the deeper analysis found that as part refactoring in of below pr #6371, we can no longer pass metadata in DataprocClusterCreateOperator() as this is not being passed to ClusterGenerator() method.

What you expected to happen: Operator should work as before.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:11 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
pateashcommented, Dec 13, 2021

@nicolas-settembrini No Problem,

you just have to generate config from all these arguments and then pass it to the DataprocClusterCreateOperator find more details in Pull request, I have attached snapshot of documentation which will be coming in next updates. #19446

1reaction
nicolas-settembrinicommented, Dec 13, 2021

Hi, sorry to write here but i didn’t find another place talking about this.

I am using Version: 2.1.4+composer and I have a DAG where i defined the DataprocClusterCreateOperator like this:

create_dataproc =  dataproc_operator.DataprocClusterCreateOperator(
  task_id='create_dataproc',
  cluster_name='dataproc-cluster-demo-{{ ds_nodash }}',
  num_workers=2,
  region='us-east4',
  zone='us-east4-a',
  subnetwork_uri='projects/example',
  internal_ip_only=True,
  tags=['allow-iap-ssh'],
  init_actions_uris=['gs://goog-dataproc-initialization-actions-us-east4/connectors/connectors.sh'],
  metadata=[('spark-bigquery-connector-url','gs://spark-lib/bigquery/spark-2.4-bigquery-0.23.1-preview.jar')],
  labels=dict(equipo='dm',ambiente='dev',etapa='datapreparation',producto='x',modelo='x'),
  master_machine_type='n1-standard-1',
  worker_machine_type='n1-standard-1',
  image_version='1.5-debian10'
  )

I passed the metadata as a sequence of tuples as i read here, using the dict is not working.

Also, the metadata is not being rendered in the cluster_config.

@pateash could you please explain a more detailed way to use your workaround? In what part of the dag could i use the workaround?

Thanks in advance

Read more comments on GitHub >

github_iconTop Results From Across the Web

Error in passing metadata to DataprocClusterCreateOperator
I am using Airflow version: 2.1.4+composer and I have a DAG where I defined the DataprocClusterCreateOperator like ...
Read more >
[GitHub] [airflow] pateash commented on issue #16911: Error in ...
[GitHub] [airflow] pateash commented on issue #16911: Error in passing metadata to DataprocClusterCreateOperator · GitBox Sat, 10 Jul 2021 13:22:15 -0700.
Read more >
airflow.contrib.operators.dataproc_operator
The operator will wait until the creation is successful or an error occurs in the creation ... Passing this threshold will cause cluster...
Read more >
Couldn't connect to dataproc metastore service whi...
create_cluster=DataprocClusterCreateOperator( task_id='create_cluster', ... metadata=[("http-proxy","http://proxy.ebiz.example.com:9290"),
Read more >
airflow.contrib.operators.dataproc_operator
The operator will wait until the creation is successful or an error occurs in ... metadata (dict) – dict of key-value google compute...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found