Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Easy way to discover valid `--partitions` when using `pudl_datastore`

See original GitHub issue

Is your feature request related to a problem? Please describe.

$ pudl_datastore --help lists a --partitions option but not how to use it, as it does not list the valid KEY=VALUE arguments (e.g. --partitions year=2018 and so on). This is using the current main branch (at https://github.com/catalyst-cooperative/pudl/tree/a2c1b996ea81015e586e392bb95609da76161cec).

Describe the solution you’d like

I think it’d be helpful to be able to list the valid partitions for a given dataset at the command line when using pudl_datastore.

One option would be to include this in the output of pudl_datastore --help (although as valid partitions vary by dataset this could be quite verbose).
- e.g. this info could be printed alongside the datasets, like
```
Available Production Datasets:
    - eia860 [partitions: years=(2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019)]
```
  although i think it would be best if the printed info (year=...) was valid syntax for the command, so it can be copy-pasted into pudl_datastore --partitions <paste>
Another alternative would be to provide a separate command or argument to access this info, and have pudl_datastore direct users to that
- e.g. pudl_datastore could print something along the lines of
```
--partition KEY=VALUE,...
                      Only retrieve resources matching these conditions.
                      To see valid partitions run `pudl_partitions --dataset DATASET`.
```
  Again, copy-pastable output (from a pudl_partitionskind of command) would be nice to have 😃

Describe alternatives you’ve considered

A current “work around” is to open python, import pudl.constants as pc and inspect pc.working_partitions.

Additional context

The output of pudl_datastore --help (using main) says only that --partitions should be key-value pairs. e.g.

$ pudl_datastore --help                                                                                                                
usage: pudl_datastore [-h] [--dataset DATASET] [--pudl_in PUDL_IN] [--validate] [--sandbox] [--loglevel LOGLEVEL] [--quiet] [--populate-gcs-cache POPULATE_GCS_CACHE]
                      [--partition KEY=VALUE,...]

Download and cache ETL source data from Zenodo.

optional arguments:
  -h, --help            show this help message and exit
...
...
  --partition KEY=VALUE,...
                        Only retrieve resources matching these conditions.
                        
Available Production Datasets:
    - censusdp1tract
...

Comparing this to the output of pudl_data --help using pudl v3.2

$ pudl_data --help
usage: pudl_data [-h] [-q] [-z] [-c] [-d DATASTORE_DIR]
                 [-s {eia860,eia861,eia923,epacems,epaipm,ferc1} [{eia860,eia861,eia923,epacems,epaipm,ferc1} ...]]
                 [-y YEARS [YEARS ...]] [--no_download]
                 [-t {AL,AR,AZ,CA,CO,CT,DC,DE,FL,GA,IA,ID,IL,IN,KS,KY,LA,MA,MD,ME,MI,MN,MO,MS,MT,NC,ND,NE,NH,NJ,NM,NV,NY,OH,OK,OR,PA,RI,SC,SD,TN,TX,UT,VA,VT,WA,WI,WV,WY} [{AL,AR,AZ,CA,CO,CT,DC,DE,FL,GA,IA,ID,IL,IN,KS,KY,LA,MA,MD,ME,MI,MN,MO,MS,MT,NC,ND,NE,NH,NJ,NM,NV,NY,OH,OK,OR,PA,RI,SC,SD,TN,TX,UT,VA,VT,WA,WI,WV,WY} ...]]

A CLI for fetching public utility data from reporting agency servers. 
...
...

optional arguments:
  -h, --help            show this help message and exit
...
...
  -s {eia860,eia861,eia923,epacems,epaipm,ferc1} [{eia860,eia861,eia923,epacems,epaipm,ferc1} ...], --sources {eia860,eia861,eia923,epacems,epaipm,ferc1} [{eia860,eia861,eia923,epacems,epaipm,ferc1} ...]
                        List of data sources which should be downloaded.
                        (default: ('eia860', 'eia861', 'eia923', 'epacems',
                        'epaipm', 'ferc1')).
  -y YEARS [YEARS ...], --years YEARS [YEARS ...]
                        List of years for which data should be downloaded.
                        Different data sources have differet valid years. If
                        data is not available for a specified year and data
                        source, it will be ignored. If no years are specified,
                        all available data will be downloaded for all
                        requested data sources.
...
...
  -t {AL,AR,AZ,CA,CO,CT,DC,DE,FL,GA,IA,ID,IL,IN,KS,KY,LA,MA,MD,ME,MI,MN,MO,MS,MT,NC,ND,NE,NH,NJ,NM,NV,NY,OH,OK,OR,PA,RI,SC,SD,TN,TX,UT,VA,VT,WA,WI,WV,WY} [{AL,AR,AZ,CA,CO,CT,DC,DE,FL,GA,IA,ID,IL,IN,KS,KY,LA,MA,MD,ME,MI,MN,MO,MS,MT,NC,ND,NE,NH,NJ,NM,NV,NY,OH,OK,OR,PA,RI,SC,SD,TN,TX,UT,VA,VT,WA,WI,WV,WY} ...], --states {AL,AR,AZ,CA,CO,CT,DC,DE,FL,GA,IA,ID,IL,IN,KS,KY,LA,MA,MD,ME,MI,MN,MO,MS,MT,NC,ND,NE,NH,NJ,NM,NV,NY,OH,OK,OR,PA,RI,SC,SD,TN,TX,UT,VA,VT,WA,WI,WV,WY} [{AL,AR,AZ,CA,CO,CT,DC,DE,FL,GA,IA,ID,IL,IN,KS,KY,LA,MA,MD,ME,MI,MN,MO,MS,MT,NC,ND,NE,NH,NJ,NM,NV,NY,OH,OK,OR,PA,RI,SC,SD,TN,TX,UT,VA,VT,WA,WI,WV,WY} ...]
                        List of two letter US state abbreviations indicating
                        which states data should be downloaded. Currently only
                        applicable to the EPA's CEMS dataset.

We see pudl_data would list the valid -y (years) and -t (states) values.

p.s. pudl is great – thanks for all your good work!

Issue Analytics

State:
Created 3 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

rousikcommented, Feb 10, 2021

Thanks for the feedback. When I have added this flag I have really not thought too much about the usability so this conversation is really useful to gain insights.

The main purpose of this was to make it easier to do local development/ETL execution by fetching smaller subset of data to work with (e.g. only specific year) w/o having to run the ETL first (that would do this download on its first pass).

As it is implemented, you can only pass a single value to each valid partition-key, so you can’t really do more than just one state or year and if you specify an unknown partition, the filtering logic will simply exclude every resource file (nothing matches).

While we could perhaps infer some of the valid partition keys from constants the right way to do this would be to fetch datapackage.json files for all known datasets from zenodo (either production or sandbox) and grab all keys from resources[].parts. Because this would depend on contacting zenodo, it’s probably not a good idea to add it to the default help screen.

What about listing this by running pudl_datastore --list-partitions?

0reactions

nickrobinson251commented, Feb 23, 2021

Resolved by https://github.com/catalyst-cooperative/pudl/pull/925 Thanks!