question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Local Docker ETL with GCS inputs/outputs

See original GitHub issue

Given a Docker container that can run the PUDL ETL reading & writing data locally (#1606), have it run locally but read and write to cloud storage.

  • Create an updated GCS cache with our current data archives to avoid Zenodo flakiness.
  • Test whether current --gcs-cache-path setup works with pudl_etl (nope!)
  • Copy Zenodo cache over to catalyst-cooperative-pudl project for now
  • Add --gcs-cache-path and related arguments to ferc1_to_sqlite script so it can use the Zenodo cache too.
  • Set Docker container to read its input data from that GCS Cache (should just mean changing the pudl_etl.sh script)
  • Set up authentication tokens / secrets so that the ETL script can write outputs to GCS buckets
  • Make GCS caching work with publicly readable cache (yields a 403 Forbidden right now)
  • Use a dynamically generated storage location with a BASE_URL followed by the git_ref (tag or branch).
  • Copy parquet & sqlite outputs to GCS once the ETL is done and the data has been validated.
  • Switch CI over to pointing at the GCS Cache if it’s easy, to avoid Zenodo flakiness. (moved to #1679)
  • Run full ETL and all tests.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:9 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
zaneselvanscommented, May 9, 2022

I feel like you’re the one on this issue now @bendnorman

0reactions
bendnormancommented, Jun 22, 2022

Write to GCS Update

I don’t think outputs are being successfully rewritten if there are existing files in buckets because deploy-pudl-vm-service-account does not have permission to delete objects. This is an issue for when we rewrite the contents of the intake.catalyst.coop/dev directory and when nightly builds are rerun due to a failure. I’ve made the deploy-pudl-vm-service-account service account a Storage Object Admin for intake.catalyst.coop and pudl-etl-logs which allows the service account to delete files.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Local Docker ETL with local inputs/outputs #1606 - GitHub
Given a Docker container with our CI environment (#1605): Add local volumes to the container to point at PUDL_IN and PUDL_OUT. Maybe with...
Read more >
Containerizing ETL Data Pipelines with Docker - Medium
Creating a Docker container of our adoptable animal data project will give us a portable, isolated environment that we can run locally and...
Read more >
Quickstart: Build and push a Docker image with Cloud Build
Learn how to get started with Cloud Build. Build a Docker image and push it to Artifact Registry.
Read more >
Developing AWS Glue ETL jobs locally using a container
Developing AWS Glue ETL jobs locally using a container ... The machine running the Docker hosts the AWS Glue container.
Read more >
Pentaho Data Integration on Kubernetes
Using Google SDK, interacts with GCS (Google Cloud Storage) bucket to download ETL artifacts (.ktr's and .kjb's). You can use other approaches ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found