Integrate FERC XBRL data into PUDL
See original GitHub issueBackground
After creating tools to translate the FERC XBRL filings into SQLite databases, we decided that the old Visual FoxPro DBF and new XBRL data will need their own independent extract + transform processes. The new data has much more structure and metadata, and will be quite a bit easier to work with than the historical data, so it doesn’t make sense to convert the new data into the old structure just so we can run it through the same old transformations (this is discussed in catalyst-cooperative/pudl#1579).
This means a lot of new code and transformations, and has precipitated a major refactor of the FERC Form 1 transformations – especially since we are going to be going after many additional Form 1 tables beyond the ones we’ve already cleaned up and integrated into the PUDL database.
Now that we have access to the raw XBRL data, we’ve been working on areas in parallel:
- Making the XBRL data acquisition & extraction process more robust and automated, and
- Re-writing the transform functions to accommodate both XBRL and DBF data, and to be more modular and re-usable as we expand our coverage to additional tables.
FERC XBRL Data Acquisition (@zschira)
- This includes the data scraping and archiving process, up to the point of having versioned Zenodo depositions available through the PUDL Datastore, and regularly updated with minimal human intervention.
- This work will take place in the
pudl-scrapers
&pudl-zenodo-storage
repositories.
Issues
- catalyst-cooperative/pudl#1593
- Package ferc-xbrl-extractor so it’s installable by PUDL
- https://github.com/catalyst-cooperative/pudl-scrapers/issues/41
- catalyst-cooperative/pudl-scrapers#45
- https://github.com/catalyst-cooperative/pudl-scrapers/issues/42
- https://github.com/catalyst-cooperative/ferc-xbrl-extractor/issues/19
- https://github.com/catalyst-cooperative/pudl-scrapers/issues/39
- https://github.com/catalyst-cooperative/pudl-scrapers/issues/26
- #1418 (important since XBRL RSS posts individual filings)
Pre-extract (@zschira)
- This includes everything that takes the raw inputs archived on Zenodo and turns them into coherent SQLite databases which we can archive, publish and use as standalone resources.
- We will also produce detailed machine-readable metadata comparable in detail to what is available in the XBRL.
- This work will primarily take place in the
ferc-xbrl-extractor
repository. - Updates that impact the main PUDL repo will be reflected on the
xbrl_integration
branch.
Release Issues
- catalyst-cooperative/pudl#1668
- catalyst-cooperative/pudl#1861
- catalyst-cooperative/pudl#1667
- Release Catalyst packages on conda-forge or isolate pre-extract steps from PUDL repo
- Archive & Publish XBRL derived FERC SQlite DBs catalyst-cooperative/pudl#1830
- https://github.com/catalyst-cooperative/ferc-xbrl-extractor/issues/17
- catalyst-cooperative/pudl#1860
- Integrate XBRL to SQLite conversion of all FERC forms into the nightly builds
- Automate Datasette redeployment as part of the nightly builds
- #2080
Post-release Issues
Update Existing ETL
Updating our transformations is a mix of software engineering and data wrangling tasks. We want to get the software somewhat stable and documented before involving lots of people working in parallel on unfamiliar data, and so we’ve broken this into 3 phases of work:
Phase 1: Software Design and Alpha Testing (@zaneselvans & @cmgosnell)
- Get to the point where the
fuel_ferc1
andplants_steam_ferc1
are loading into the DB successfully, and provide functionality comparable to the old DBF, extending data coverage through 2021. - Work on this phase will branch off of and be merged into
xbrl_steam
. - Unit and Integration tests should exist and pass, at which point we can merge the
xbrl_steam
branch intoxbrl_integration
. - After this phase, documentation should be good enough, and the design stable enough that we can bring in other people to work on additional tables that we have existing transforms for, and familiarity with.
Issues
- catalyst-cooperative/pudl#1739
- catalyst-cooperative/pudl#1706
- catalyst-cooperative/pudl#1738
- catalyst-cooperative/pudl#1722
- catalyst-cooperative/pudl#1707
- catalyst-cooperative/pudl#1876
- catalyst-cooperative/pudl#1853
- catalyst-cooperative/pudl#1878
- catalyst-cooperative/pudl#1877
- catalyst-cooperative/pudl#1705
- Merge catalyst-cooperative/pudl#1721
- catalyst-cooperative/pudl#1924
- Merge catalyst-cooperative/pudl#1962
Phase 2: Beta Testing w/ Familiar Tables (#1801)
- Refactor all of our our existing FERC Form 1 transform functions to use the new framework, extending coverage to the 2021 XBRL data.
- This will include creating additional transform functions and parameterizations as needed to deal with more kinds of tables and data problems not encountered in the
fuel_ferc1
andplants_steam_ferc1
tables. - Based on feedback from this experience we may also make some changes to the transform framework.
- Work in this phase will branch off of and merge back into the
xbrl_integration
branch. - When this phase is complete, and all unit & integration tests are passing, we will merge
xbrl_integration
intodev
in the PUDL repository, and can make an initial release of this data publicly.
Issues
- #1981 @cmgosnell
- #1801
- catalyst-cooperative/pudl#1802 @cmgosnell
- catalyst-cooperative/pudl#1803 @cmgosnell
- catalyst-cooperative/pudl#1820 @cmgosnell
- catalyst-cooperative/pudl#1735 @aesharpe
- catalyst-cooperative/pudl#1807 @zaneselvans & @cmgosnell
- Merge catalyst-cooperative/pudl#1665
Phase 3: Integrate New FERC 1 Data & Methods
- Now that we’ve got the 2021 data integrated with and working as well as the 2020 data, we’ll move on to expanding coverage to other tables, in both the earlier DBF data and new XBRL data.
- At this point the software design will hopefully be stable and able to deal with whatever new problems we encounter.
- We will also start integrating new data cleaning methods that we haven’t previously employed in our published data.
- Work in this phase will branch off of and merge into
dev
. - Which tables and data cleaning to prioritize will be guided by @arengel & @jrea-rmi, along the lines of #1568
key
- ⭐ = DBF-XBRL mapping is simple & worth delegating, table is ready to be taken on
- 🟧 = table is ready to be taken on, but may require new reshaping transforms
Issues
- catalyst-cooperative/pudl#2040 @zaneselvans & @aesharpe
- #2110 @zaneselvans
- catalyst-cooperative/pudl#2012 @zaneselvans
- catalyst-cooperative/pudl#2021 @zaneselvans
- #2075
- catalyst-cooperative/pudl#2014 @zaneselvans
- #1804
- catalyst-cooperative/pudl#1807 @zaneselvans & @cmgosnell
- #1805 ⭐
- #1806 @cmgosnell
- #1808 (one-to-many, 4 XBRL tables) @zaneselvans
- #1809
- #1812
- #1818
- catalyst-cooperative/pudl#1820 @cmgosnell
- #1819 @cmgosnell
2023 Issues
- #471
- #2076
- catalyst-cooperative/pudl#2016 @zaneselvans & @cmgosnell
- #2074
- catalyst-cooperative/pudl#2015 @zaneselvans & @zschira
- #2066
- catalyst-cooperative/pudl#1980
- Integrate total labeler into small/hydro/pumped transform step @aesharpe
- catalyst-cooperative/pudl#1968 @zaneselvans
- Generalize our per-row data fixes so we can retain the bad / weird rows that have employees, totals, etc. See e.g. catalyst-cooperative/pudl#1825
- Transform other, lower priority RMI tables
Issue Analytics
- State:
- Created a year ago
- Comments:13 (13 by maintainers)
Top GitHub Comments
Some more thoughts on Dagster:
You can create nested graphs in Dagster to allow for some logic hierarchy. @zaneselvans and I envision incrementally applying dagster graphs to our ETL. There are multiple levels we’ve identified:
_etl_{dataset}()
functions into ops and construct a graph. This is a very simple DAG but would enable us to run the ETLs in separate processes.When we’re writing XBRL functions we don’t need to think much about the first two levels because we are only making changes within the
ferc1.transform()
function.Option 3
I wrote up a little prototype for option 3. It’s ok but it seems awkward to have to define all of the transformed tables in the transform graph’s outs param:
https://github.com/catalyst-cooperative/pudl/blob/aa36c91766e70eb3a178232655f7c79fb60b6a87/notebooks/work-in-progress/ferc_dagster_prototype.py#L25-L33
This is the recommended method for returning multiple outputs from graphs. I’m curious if it is possible for graphs to treat dictionaries as a single output instead of multiple.
Satisfying option three shouldn’t be that difficult because we can use the existing structure in
transform.ferc1
. I do have a couple of questions:plants_steam()
depends on thefuel_ferc1()
table for plant_id assignment.ferc1_raw_dfs
? This might not be the case in Form 1 land but I’m pretty sure final EIA transform functions rely on multiple raw tables to produce a single cleaned table.Option 4
Option 4 is a bit trickier because we want generic cleaning functions that can be parameterized for each table but do not have to use dagster config. Dagster recommends using Op factories for this situation. They work but feel a little kludgy. Here is an example of a generic transform op:
rename_columns_factory()
parametrizes the inner functionrename_df()
which is an op. It’s kind of mind-bending because there is a lot of function wrapping / decorating going on here. If we like this pattern, this is what a dagster friendly version without dagster abstractions would look like:An open question here is where we want to store the transform parameters.
I’m kind of cooling off on converting cleaning functions to dagster ops: