question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Easy way to discover valid `--partitions` when using `pudl_datastore`

See original GitHub issue

Is your feature request related to a problem? Please describe.

$ pudl_datastore --help lists a --partitions option but not how to use it, as it does not list the valid KEY=VALUE arguments (e.g. --partitions year=2018 and so on). This is using the current main branch (at https://github.com/catalyst-cooperative/pudl/tree/a2c1b996ea81015e586e392bb95609da76161cec).

Describe the solution you’d like

I think it’d be helpful to be able to list the valid partitions for a given dataset at the command line when using pudl_datastore.

  • One option would be to include this in the output of pudl_datastore --help (although as valid partitions vary by dataset this could be quite verbose).
    • e.g. this info could be printed alongside the datasets, like
      Available Production Datasets:
          - eia860 [partitions: years=(2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019)]
      
      although i think it would be best if the printed info (year=...) was valid syntax for the command, so it can be copy-pasted into pudl_datastore --partitions <paste>
  • Another alternative would be to provide a separate command or argument to access this info, and have pudl_datastore direct users to that
    • e.g. pudl_datastore could print something along the lines of
      --partition KEY=VALUE,...
                            Only retrieve resources matching these conditions.
                            To see valid partitions run `pudl_partitions --dataset DATASET`.
      
      Again, copy-pastable output (from a pudl_partitionskind of command) would be nice to have 😃

Describe alternatives you’ve considered

A current “work around” is to open python, import pudl.constants as pc and inspect pc.working_partitions.

Additional context

The output of pudl_datastore --help (using main) says only that --partitions should be key-value pairs. e.g.

$ pudl_datastore --help                                                                                                                
usage: pudl_datastore [-h] [--dataset DATASET] [--pudl_in PUDL_IN] [--validate] [--sandbox] [--loglevel LOGLEVEL] [--quiet] [--populate-gcs-cache POPULATE_GCS_CACHE]
                      [--partition KEY=VALUE,...]

Download and cache ETL source data from Zenodo.

optional arguments:
  -h, --help            show this help message and exit
...
...
  --partition KEY=VALUE,...
                        Only retrieve resources matching these conditions.
                        
Available Production Datasets:
    - censusdp1tract
...

Comparing this to the output of pudl_data --help using pudl v3.2

$ pudl_data --help
usage: pudl_data [-h] [-q] [-z] [-c] [-d DATASTORE_DIR]
                 [-s {eia860,eia861,eia923,epacems,epaipm,ferc1} [{eia860,eia861,eia923,epacems,epaipm,ferc1} ...]]
                 [-y YEARS [YEARS ...]] [--no_download]
                 [-t {AL,AR,AZ,CA,CO,CT,DC,DE,FL,GA,IA,ID,IL,IN,KS,KY,LA,MA,MD,ME,MI,MN,MO,MS,MT,NC,ND,NE,NH,NJ,NM,NV,NY,OH,OK,OR,PA,RI,SC,SD,TN,TX,UT,VA,VT,WA,WI,WV,WY} [{AL,AR,AZ,CA,CO,CT,DC,DE,FL,GA,IA,ID,IL,IN,KS,KY,LA,MA,MD,ME,MI,MN,MO,MS,MT,NC,ND,NE,NH,NJ,NM,NV,NY,OH,OK,OR,PA,RI,SC,SD,TN,TX,UT,VA,VT,WA,WI,WV,WY} ...]]

A CLI for fetching public utility data from reporting agency servers. 
...
...

optional arguments:
  -h, --help            show this help message and exit
...
...
  -s {eia860,eia861,eia923,epacems,epaipm,ferc1} [{eia860,eia861,eia923,epacems,epaipm,ferc1} ...], --sources {eia860,eia861,eia923,epacems,epaipm,ferc1} [{eia860,eia861,eia923,epacems,epaipm,ferc1} ...]
                        List of data sources which should be downloaded.
                        (default: ('eia860', 'eia861', 'eia923', 'epacems',
                        'epaipm', 'ferc1')).
  -y YEARS [YEARS ...], --years YEARS [YEARS ...]
                        List of years for which data should be downloaded.
                        Different data sources have differet valid years. If
                        data is not available for a specified year and data
                        source, it will be ignored. If no years are specified,
                        all available data will be downloaded for all
                        requested data sources.
...
...
  -t {AL,AR,AZ,CA,CO,CT,DC,DE,FL,GA,IA,ID,IL,IN,KS,KY,LA,MA,MD,ME,MI,MN,MO,MS,MT,NC,ND,NE,NH,NJ,NM,NV,NY,OH,OK,OR,PA,RI,SC,SD,TN,TX,UT,VA,VT,WA,WI,WV,WY} [{AL,AR,AZ,CA,CO,CT,DC,DE,FL,GA,IA,ID,IL,IN,KS,KY,LA,MA,MD,ME,MI,MN,MO,MS,MT,NC,ND,NE,NH,NJ,NM,NV,NY,OH,OK,OR,PA,RI,SC,SD,TN,TX,UT,VA,VT,WA,WI,WV,WY} ...], --states {AL,AR,AZ,CA,CO,CT,DC,DE,FL,GA,IA,ID,IL,IN,KS,KY,LA,MA,MD,ME,MI,MN,MO,MS,MT,NC,ND,NE,NH,NJ,NM,NV,NY,OH,OK,OR,PA,RI,SC,SD,TN,TX,UT,VA,VT,WA,WI,WV,WY} [{AL,AR,AZ,CA,CO,CT,DC,DE,FL,GA,IA,ID,IL,IN,KS,KY,LA,MA,MD,ME,MI,MN,MO,MS,MT,NC,ND,NE,NH,NJ,NM,NV,NY,OH,OK,OR,PA,RI,SC,SD,TN,TX,UT,VA,VT,WA,WI,WV,WY} ...]
                        List of two letter US state abbreviations indicating
                        which states data should be downloaded. Currently only
                        applicable to the EPA's CEMS dataset.

We see pudl_data would list the valid -y (years) and -t (states) values.


p.s. pudl is great – thanks for all your good work!

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
rousikcommented, Feb 10, 2021

Thanks for the feedback. When I have added this flag I have really not thought too much about the usability so this conversation is really useful to gain insights.

The main purpose of this was to make it easier to do local development/ETL execution by fetching smaller subset of data to work with (e.g. only specific year) w/o having to run the ETL first (that would do this download on its first pass).

As it is implemented, you can only pass a single value to each valid partition-key, so you can’t really do more than just one state or year and if you specify an unknown partition, the filtering logic will simply exclude every resource file (nothing matches).

While we could perhaps infer some of the valid partition keys from constants the right way to do this would be to fetch datapackage.json files for all known datasets from zenodo (either production or sandbox) and grab all keys from resources[].parts. Because this would depend on contacting zenodo, it’s probably not a good idea to add it to the default help screen.

What about listing this by running pudl_datastore --list-partitions?

0reactions
nickrobinson251commented, Feb 23, 2021
Read more comments on GitHub >

github_iconTop Results From Across the Web

How to Use - Partition Find and Mount
Partition Find and Mount is a free software that allows you to recover lost partitions by locating and mounting them into the system....
Read more >
Query partitioned tables | BigQuery - Google Cloud
Best practices for partition pruning · Use a constant filter expression · Isolate the partition column in your filter · Require a partition...
Read more >
Windows 10 clean install can't find a valid partition
Windows 10 clean install can't find a valid partition ... open gParted app >> very simple way and clean way to setup your...
Read more >
Get the best out of Oracle Partitioning
How does Partitioning work? Challenges: Large tables are difficult to manage. Solution: Partitioning. • Divide and conquer. • Easier data ...
Read more >
Unable to read partition information from this disk (1008886)
Checking for non-msdos disklabels and non-GPT partition table layouts · Open a console to the ESX or ESXi host. · Identify the disk...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found