Write a data release script
See original GitHub issueWe’re publishing / archiving our data releases at Zenodo, and they provide a RESTful API that we can access from within Python to automate the release process and to populate the Zenodo metadata fields, many of which we already compile for our own metadata.
In addition, we need to make sure that we generate the archived data in a controlled and reproducible way – using a particular published release of the catalystcoop.pudl
package, in a well-defined Python environment.
This will all be much more reliable and repeatable (and easier) if we have a script that does it for us the same way ever time. What all would such a script need to do?
- Create a reproducible software environment: Install the most recently released version of
catalystcoop.pudl
in a fresh conda environment along with all of its dependencies, and record the entire list of installed python packages and their versions in anenvironment.yml
orrequirements.txt
file. - Acquire fresh data: Download a fresh copy of all of the input data to be packaged, so we know what date it is from. Otherwise we all have potentially different collections of older downloads in our datastores, since sometimes agencies go back and alter the data after its initial publication.
- Run the ETL process: Using the installed PUDL software and the fresh data, generate the bundle of tabular data packages to be released. This may include reserving one or more DOIs using the Zenodo API (See Issue #419).
- Validate the packaged data: Populate a local SQLite DB with all of the data to be released, and run the full set of data validation tests on it. Any failures should be documented in the data release notes.
- Create a compressed archive of the inputs: This is a
.zip
or.tgz
of the datastore which was used in the ETL process, for archiving alongside the data packages, so that the process can be reproduced. Should this be one giant file? Or should we break it out by data source? - Create a compressed archive of the outputs: This is a
.zip
or.tgz
of each of the individual data packages that’s being archived. Keeping the data packages separate means people can download only the ones that they need/want. - Upload everything to Zenodo: If a DOI has been reserved earlier in the process, we’ll need to make sure that we’re uploading to that same archive. Stuff to archive includes:
- the compressed archive(s) of the input datastore.
- the
environment.yml
file defining the python environment that was used. - the compressed archives of the output data packages.
- a script which, given the contents of the Zenodo archive, can perfectly re-create the archived data packages. This will probably be the same script that this issue is referring to.
- Populate Zenodo metadata: Using existing metadata (and possibly some additional information that is given to the data release script as input), populate the metadata associated with the data release at Zenodo using their RESTful API.
Issue Analytics
- State:
- Created 4 years ago
- Comments:8 (6 by maintainers)
Top Results From Across the Web
Writing Scripts and Working with Data
Write a basic shell script. Use the bash command to execute a shell script. Use chmod to make a script an executable program....
Read more >Create a dataset loading script - Hugging Face
Write a dataset script to load and share your own datasets. ... The script can download data files from any website, or from...
Read more >Bash scripting: How to write data to text files | Enable Sysadmin
To write data to a text file from a Bash script, use output/error redirection with the > and >> redirection operators. > Overwrites...
Read more >GitHub - SQLPlayer/DataScriptWriter
Windows desktop application allows you to connect to Microsoft SQL Server database and make a T-SQL script of data for selected tables.
Read more >How can I write a script to get data from a table - ServiceNow
Solved: Hello , I am trying to make a variable in the Server Script to get sys_id and name from a table and...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Yeah, I think without changes to the metadata specification, that’s probably the best thing to do. We already have a
datapkg-bundle-uuid
field at the package level, so I guess we can just add a paralleldatapkg-bundle-doi
field alongside it for the bundles that get archived, and set theid
field to a generic UUID so it can definitely be uniquely identified if it’s found in the wild somewhere.Hi @zaneselvans!
For the DOI question, have you come to a conclusion? I’m happy to jump on a call to discuss if that would help (we can chat at our normal call on Monday, but I’m happy to chat earlier too). My thoughts in a nutshell: the large zenodo archive (containing datapacakges all generated by the same ETL, which should all be compatible*) gets one DOI. The datapackages inside that archive get 2 IDs - one that is the UUID for the datapackage, and one that is called something like “master_archive_id” that is the same as the archive’s DOI. That way the single datapackages can be linked back to the master archive. Does that make sense? Am I missing something? I had to draw out a diagram to think this through 🙃
*assuming that all datapackages from the same ETL should be compatible
For the validation question, I’m going to tag @roll, but I think your idea here is correct: