Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Write a data release script

See original GitHub issue

We’re publishing / archiving our data releases at Zenodo, and they provide a RESTful API that we can access from within Python to automate the release process and to populate the Zenodo metadata fields, many of which we already compile for our own metadata.

In addition, we need to make sure that we generate the archived data in a controlled and reproducible way – using a particular published release of the catalystcoop.pudl package, in a well-defined Python environment.

This will all be much more reliable and repeatable (and easier) if we have a script that does it for us the same way ever time. What all would such a script need to do?

Create a reproducible software environment: Install the most recently released version of catalystcoop.pudl in a fresh conda environment along with all of its dependencies, and record the entire list of installed python packages and their versions in an environment.yml or requirements.txt file.
Acquire fresh data: Download a fresh copy of all of the input data to be packaged, so we know what date it is from. Otherwise we all have potentially different collections of older downloads in our datastores, since sometimes agencies go back and alter the data after its initial publication.
Run the ETL process: Using the installed PUDL software and the fresh data, generate the bundle of tabular data packages to be released. This may include reserving one or more DOIs using the Zenodo API (See Issue #419).
Validate the packaged data: Populate a local SQLite DB with all of the data to be released, and run the full set of data validation tests on it. Any failures should be documented in the data release notes.
Create a compressed archive of the inputs: This is a .zip or .tgz of the datastore which was used in the ETL process, for archiving alongside the data packages, so that the process can be reproduced. Should this be one giant file? Or should we break it out by data source?
Create a compressed archive of the outputs: This is a .zip or .tgz of each of the individual data packages that’s being archived. Keeping the data packages separate means people can download only the ones that they need/want.
Upload everything to Zenodo: If a DOI has been reserved earlier in the process, we’ll need to make sure that we’re uploading to that same archive. Stuff to archive includes:
- the compressed archive(s) of the input datastore.
- the environment.yml file defining the python environment that was used.
- the compressed archives of the output data packages.
- a script which, given the contents of the Zenodo archive, can perfectly re-create the archived data packages. This will probably be the same script that this issue is referring to.
Populate Zenodo metadata: Using existing metadata (and possibly some additional information that is given to the data release script as input), populate the metadata associated with the data release at Zenodo using their RESTful API.

Issue Analytics

State:
Created 4 years ago
Comments:8 (6 by maintainers)

Top GitHub Comments

1reaction

zaneselvanscommented, Oct 22, 2019

Yeah, I think without changes to the metadata specification, that’s probably the best thing to do. We already have a datapkg-bundle-uuid field at the package level, so I guess we can just add a parallel datapkg-bundle-doi field alongside it for the bundles that get archived, and set the id field to a generic UUID so it can definitely be uniquely identified if it’s found in the wild somewhere.

0reactions

lwinfreecommented, Oct 22, 2019

Hi @zaneselvans!

For the DOI question, have you come to a conclusion? I’m happy to jump on a call to discuss if that would help (we can chat at our normal call on Monday, but I’m happy to chat earlier too). My thoughts in a nutshell: the large zenodo archive (containing datapacakges all generated by the same ETL, which should all be compatible*) gets one DOI. The datapackages inside that archive get 2 IDs - one that is the UUID for the datapackage, and one that is called something like “master_archive_id” that is the same as the archive’s DOI. That way the single datapackages can be linked back to the master archive. Does that make sense? Am I missing something? I had to draw out a diagram to think this through 🙃

*assuming that all datapackages from the same ETL should be compatible

For the validation question, I’m going to tag @roll, but I think your idea here is correct:

I think the right solution probably goes along with issue #400, moving the data validation test-case specifications into package_data, and any required functions into pudl.validate if they aren’t already there