Easy way to discover valid `--partitions` when using `pudl_datastore`
See original GitHub issueIs your feature request related to a problem? Please describe.
$ pudl_datastore --help
lists a --partitions
option but not how to use it, as it does not list the valid KEY=VALUE
arguments (e.g. --partitions year=2018
and so on). This is using the current main
branch (at https://github.com/catalyst-cooperative/pudl/tree/a2c1b996ea81015e586e392bb95609da76161cec).
Describe the solution you’d like
I think it’d be helpful to be able to list the valid partitions
for a given dataset at the command line when using pudl_datastore
.
- One option would be to include this in the output of
pudl_datastore --help
(although as validpartitions
vary by dataset this could be quite verbose).- e.g. this info could be printed alongside the datasets, like
although i think it would be best if the printed info (Available Production Datasets: - eia860 [partitions: years=(2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019)]
year=...
) was valid syntax for the command, so it can be copy-pasted intopudl_datastore --partitions <paste>
- e.g. this info could be printed alongside the datasets, like
- Another alternative would be to provide a separate command or argument to access this info, and have
pudl_datastore
direct users to that- e.g.
pudl_datastore
could print something along the lines of
Again, copy-pastable output (from a--partition KEY=VALUE,... Only retrieve resources matching these conditions. To see valid partitions run `pudl_partitions --dataset DATASET`.
pudl_partitions
kind of command) would be nice to have 😃
- e.g.
Describe alternatives you’ve considered
A current “work around” is to open python, import pudl.constants as pc
and inspect pc.working_partitions
.
Additional context
The output of pudl_datastore --help
(using main
) says only that --partitions
should be key-value pairs. e.g.
$ pudl_datastore --help
usage: pudl_datastore [-h] [--dataset DATASET] [--pudl_in PUDL_IN] [--validate] [--sandbox] [--loglevel LOGLEVEL] [--quiet] [--populate-gcs-cache POPULATE_GCS_CACHE]
[--partition KEY=VALUE,...]
Download and cache ETL source data from Zenodo.
optional arguments:
-h, --help show this help message and exit
...
...
--partition KEY=VALUE,...
Only retrieve resources matching these conditions.
Available Production Datasets:
- censusdp1tract
...
Comparing this to the output of pudl_data --help
using pudl v3.2
$ pudl_data --help
usage: pudl_data [-h] [-q] [-z] [-c] [-d DATASTORE_DIR]
[-s {eia860,eia861,eia923,epacems,epaipm,ferc1} [{eia860,eia861,eia923,epacems,epaipm,ferc1} ...]]
[-y YEARS [YEARS ...]] [--no_download]
[-t {AL,AR,AZ,CA,CO,CT,DC,DE,FL,GA,IA,ID,IL,IN,KS,KY,LA,MA,MD,ME,MI,MN,MO,MS,MT,NC,ND,NE,NH,NJ,NM,NV,NY,OH,OK,OR,PA,RI,SC,SD,TN,TX,UT,VA,VT,WA,WI,WV,WY} [{AL,AR,AZ,CA,CO,CT,DC,DE,FL,GA,IA,ID,IL,IN,KS,KY,LA,MA,MD,ME,MI,MN,MO,MS,MT,NC,ND,NE,NH,NJ,NM,NV,NY,OH,OK,OR,PA,RI,SC,SD,TN,TX,UT,VA,VT,WA,WI,WV,WY} ...]]
A CLI for fetching public utility data from reporting agency servers.
...
...
optional arguments:
-h, --help show this help message and exit
...
...
-s {eia860,eia861,eia923,epacems,epaipm,ferc1} [{eia860,eia861,eia923,epacems,epaipm,ferc1} ...], --sources {eia860,eia861,eia923,epacems,epaipm,ferc1} [{eia860,eia861,eia923,epacems,epaipm,ferc1} ...]
List of data sources which should be downloaded.
(default: ('eia860', 'eia861', 'eia923', 'epacems',
'epaipm', 'ferc1')).
-y YEARS [YEARS ...], --years YEARS [YEARS ...]
List of years for which data should be downloaded.
Different data sources have differet valid years. If
data is not available for a specified year and data
source, it will be ignored. If no years are specified,
all available data will be downloaded for all
requested data sources.
...
...
-t {AL,AR,AZ,CA,CO,CT,DC,DE,FL,GA,IA,ID,IL,IN,KS,KY,LA,MA,MD,ME,MI,MN,MO,MS,MT,NC,ND,NE,NH,NJ,NM,NV,NY,OH,OK,OR,PA,RI,SC,SD,TN,TX,UT,VA,VT,WA,WI,WV,WY} [{AL,AR,AZ,CA,CO,CT,DC,DE,FL,GA,IA,ID,IL,IN,KS,KY,LA,MA,MD,ME,MI,MN,MO,MS,MT,NC,ND,NE,NH,NJ,NM,NV,NY,OH,OK,OR,PA,RI,SC,SD,TN,TX,UT,VA,VT,WA,WI,WV,WY} ...], --states {AL,AR,AZ,CA,CO,CT,DC,DE,FL,GA,IA,ID,IL,IN,KS,KY,LA,MA,MD,ME,MI,MN,MO,MS,MT,NC,ND,NE,NH,NJ,NM,NV,NY,OH,OK,OR,PA,RI,SC,SD,TN,TX,UT,VA,VT,WA,WI,WV,WY} [{AL,AR,AZ,CA,CO,CT,DC,DE,FL,GA,IA,ID,IL,IN,KS,KY,LA,MA,MD,ME,MI,MN,MO,MS,MT,NC,ND,NE,NH,NJ,NM,NV,NY,OH,OK,OR,PA,RI,SC,SD,TN,TX,UT,VA,VT,WA,WI,WV,WY} ...]
List of two letter US state abbreviations indicating
which states data should be downloaded. Currently only
applicable to the EPA's CEMS dataset.
We see pudl_data
would list the valid -y
(years) and -t
(states) values.
p.s. pudl
is great – thanks for all your good work!
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (3 by maintainers)
Top GitHub Comments
Thanks for the feedback. When I have added this flag I have really not thought too much about the usability so this conversation is really useful to gain insights.
The main purpose of this was to make it easier to do local development/ETL execution by fetching smaller subset of data to work with (e.g. only specific year) w/o having to run the ETL first (that would do this download on its first pass).
As it is implemented, you can only pass a single value to each valid partition-key, so you can’t really do more than just one state or year and if you specify an unknown partition, the filtering logic will simply exclude every resource file (nothing matches).
While we could perhaps infer some of the valid partition keys from
constants
the right way to do this would be to fetchdatapackage.json
files for all known datasets from zenodo (either production or sandbox) and grab all keys fromresources[].parts
. Because this would depend on contacting zenodo, it’s probably not a good idea to add it to the default help screen.What about listing this by running
pudl_datastore --list-partitions
?Resolved by https://github.com/catalyst-cooperative/pudl/pull/925 Thanks!