Local Docker ETL with GCS inputs/outputs

See original GitHub issue

Given a Docker container that can run the PUDL ETL reading & writing data locally (#1606), have it run locally but read and write to cloud storage.

Create an updated GCS cache with our current data archives to avoid Zenodo flakiness.
Test whether current --gcs-cache-path setup works with pudl_etl (nope!)
Copy Zenodo cache over to catalyst-cooperative-pudl project for now
Add --gcs-cache-path and related arguments to ferc1_to_sqlite script so it can use the Zenodo cache too.
Set Docker container to read its input data from that GCS Cache (should just mean changing the pudl_etl.sh script)
Set up authentication tokens / secrets so that the ETL script can write outputs to GCS buckets
Make GCS caching work with publicly readable cache (yields a 403 Forbidden right now)
Use a dynamically generated storage location with a BASE_URL followed by the git_ref (tag or branch).
Copy parquet & sqlite outputs to GCS once the ETL is done and the data has been validated.
Switch CI over to pointing at the GCS Cache if it’s easy, to avoid Zenodo flakiness. (moved to #1679)
Run full ETL and all tests.

Issue Analytics

State:
Created a year ago
Comments:9 (9 by maintainers)

Top GitHub Comments

1reaction

zaneselvanscommented, May 9, 2022

I feel like you’re the one on this issue now @bendnorman

0reactions

bendnormancommented, Jun 22, 2022

Write to GCS Update

I don’t think outputs are being successfully rewritten if there are existing files in buckets because deploy-pudl-vm-service-account does not have permission to delete objects. This is an issue for when we rewrite the contents of the intake.catalyst.coop/dev directory and when nightly builds are rerun due to a failure. I’ve made the deploy-pudl-vm-service-account service account a Storage Object Admin for intake.catalyst.coop and pudl-etl-logs which allows the service account to delete files.