Local Docker ETL with GCS inputs/outputs
See original GitHub issueGiven a Docker container that can run the PUDL ETL reading & writing data locally (#1606), have it run locally but read and write to cloud storage.
- Create an updated GCS cache with our current data archives to avoid Zenodo flakiness.
- Test whether current
--gcs-cache-path
setup works withpudl_etl
(nope!) - Copy Zenodo cache over to
catalyst-cooperative-pudl
project for now - Add
--gcs-cache-path
and related arguments toferc1_to_sqlite
script so it can use the Zenodo cache too. - Set Docker container to read its input data from that GCS Cache (should just mean changing the
pudl_etl.sh
script) - Set up authentication tokens / secrets so that the ETL script can write outputs to GCS buckets
- Make GCS caching work with publicly readable cache (yields a 403 Forbidden right now)
- Use a dynamically generated storage location with a
BASE_URL
followed by thegit_ref
(tag or branch). - Copy parquet & sqlite outputs to GCS once the ETL is done and the data has been validated.
- Switch CI over to pointing at the GCS Cache if it’s easy, to avoid Zenodo flakiness. (moved to #1679)
- Run full ETL and all tests.
Issue Analytics
- State:
- Created a year ago
- Comments:9 (9 by maintainers)
Top Results From Across the Web
Local Docker ETL with local inputs/outputs #1606 - GitHub
Given a Docker container with our CI environment (#1605): Add local volumes to the container to point at PUDL_IN and PUDL_OUT. Maybe with...
Read more >Containerizing ETL Data Pipelines with Docker - Medium
Creating a Docker container of our adoptable animal data project will give us a portable, isolated environment that we can run locally and...
Read more >Quickstart: Build and push a Docker image with Cloud Build
Learn how to get started with Cloud Build. Build a Docker image and push it to Artifact Registry.
Read more >Developing AWS Glue ETL jobs locally using a container
Developing AWS Glue ETL jobs locally using a container ... The machine running the Docker hosts the AWS Glue container.
Read more >Pentaho Data Integration on Kubernetes
Using Google SDK, interacts with GCS (Google Cloud Storage) bucket to download ETL artifacts (.ktr's and .kjb's). You can use other approaches ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I feel like you’re the one on this issue now @bendnorman
Write to GCS Update
I don’t think outputs are being successfully rewritten if there are existing files in buckets because
deploy-pudl-vm-service-account
does not have permission to delete objects. This is an issue for when we rewrite the contents of theintake.catalyst.coop/dev
directory and when nightly builds are rerun due to a failure. I’ve made thedeploy-pudl-vm-service-account
service account a Storage Object Admin forintake.catalyst.coop
andpudl-etl-logs
which allows the service account to delete files.