Integrate FERC XBRL data into PUDL

Background

After creating tools to translate the FERC XBRL filings into SQLite databases, we decided that the old Visual FoxPro DBF and new XBRL data will need their own independent extract + transform processes. The new data has much more structure and metadata, and will be quite a bit easier to work with than the historical data, so it doesn’t make sense to convert the new data into the old structure just so we can run it through the same old transformations (this is discussed in catalyst-cooperative/pudl#1579).

This means a lot of new code and transformations, and has precipitated a major refactor of the FERC Form 1 transformations – especially since we are going to be going after many additional Form 1 tables beyond the ones we’ve already cleaned up and integrated into the PUDL database.

Now that we have access to the raw XBRL data, we’ve been working on areas in parallel:

Making the XBRL data acquisition & extraction process more robust and automated, and
Re-writing the transform functions to accommodate both XBRL and DBF data, and to be more modular and re-usable as we expand our coverage to additional tables.

FERC XBRL Data Acquisition (@zschira)

This includes the data scraping and archiving process, up to the point of having versioned Zenodo depositions available through the PUDL Datastore, and regularly updated with minimal human intervention.
This work will take place in the pudl-scrapers & pudl-zenodo-storage repositories.

Issues

catalyst-cooperative/pudl#1593
Package ferc-xbrl-extractor so it’s installable by PUDL
https://github.com/catalyst-cooperative/pudl-scrapers/issues/41
catalyst-cooperative/pudl-scrapers#45
https://github.com/catalyst-cooperative/pudl-scrapers/issues/42
https://github.com/catalyst-cooperative/ferc-xbrl-extractor/issues/19
https://github.com/catalyst-cooperative/pudl-scrapers/issues/39
https://github.com/catalyst-cooperative/pudl-scrapers/issues/26
#1418 (important since XBRL RSS posts individual filings)

Pre-extract (@zschira)

This includes everything that takes the raw inputs archived on Zenodo and turns them into coherent SQLite databases which we can archive, publish and use as standalone resources.
We will also produce detailed machine-readable metadata comparable in detail to what is available in the XBRL.
This work will primarily take place in the ferc-xbrl-extractor repository.
Updates that impact the main PUDL repo will be reflected on the xbrl_integration branch.

Release Issues

catalyst-cooperative/pudl#1668
catalyst-cooperative/pudl#1861
- FERC 1: Both DBF and XBRL now
- FERC 714: XBRL for now, CSV later, see catalyst-cooperative/pudl#1859
- FERC 2: XBRL now, DBF later, see catalyst-cooperative/pudl#1859
- FERC 6 & 60: XBRL now, possibly DBF later, see catalyst-cooperative/pudl#1859
catalyst-cooperative/pudl#1667
Release Catalyst packages on conda-forge or isolate pre-extract steps from PUDL repo
- ferc-xbrl-extractor. conda-forge feedstock here
- Arelle. conda-forge feedstock here
Archive & Publish XBRL derived FERC SQlite DBs catalyst-cooperative/pudl#1830
- https://github.com/catalyst-cooperative/ferc-xbrl-extractor/issues/17
- catalyst-cooperative/pudl#1860
- Integrate XBRL to SQLite conversion of all FERC forms into the nightly builds
- Automate Datasette redeployment as part of the nightly builds
- #2080

Post-release Issues

catalyst-cooperative/ferc-xbrl-extractor#33
#2081
#1420
catalyst-cooperative/pudl#1863

Update Existing ETL

Updating our transformations is a mix of software engineering and data wrangling tasks. We want to get the software somewhat stable and documented before involving lots of people working in parallel on unfamiliar data, and so we’ve broken this into 3 phases of work:

Phase 1: Software Design and Alpha Testing (@zaneselvans & @cmgosnell)

Get to the point where the fuel_ferc1 and plants_steam_ferc1 are loading into the DB successfully, and provide functionality comparable to the old DBF, extending data coverage through 2021.
Work on this phase will branch off of and be merged into xbrl_steam.
Unit and Integration tests should exist and pass, at which point we can merge the xbrl_steam branch into xbrl_integration.
After this phase, documentation should be good enough, and the design stable enough that we can bring in other people to work on additional tables that we have existing transforms for, and familiarity with.

Issues

catalyst-cooperative/pudl#1739
catalyst-cooperative/pudl#1706
catalyst-cooperative/pudl#1738
catalyst-cooperative/pudl#1722
catalyst-cooperative/pudl#1707
catalyst-cooperative/pudl#1876
catalyst-cooperative/pudl#1853
catalyst-cooperative/pudl#1878
catalyst-cooperative/pudl#1877
catalyst-cooperative/pudl#1705
Merge catalyst-cooperative/pudl#1721
catalyst-cooperative/pudl#1924
Merge catalyst-cooperative/pudl#1962

Phase 2: Beta Testing w/ Familiar Tables (#1801)

Refactor all of our our existing FERC Form 1 transform functions to use the new framework, extending coverage to the 2021 XBRL data.
This will include creating additional transform functions and parameterizations as needed to deal with more kinds of tables and data problems not encountered in the fuel_ferc1 and plants_steam_ferc1 tables.
Based on feedback from this experience we may also make some changes to the transform framework.
Work in this phase will branch off of and merge back into the xbrl_integration branch.
When this phase is complete, and all unit & integration tests are passing, we will merge xbrl_integration into dev in the PUDL repository, and can make an initial release of this data publicly.

Issues

#1981 @cmgosnell
#1801
- catalyst-cooperative/pudl#1802 @cmgosnell
- catalyst-cooperative/pudl#1803 @cmgosnell
- catalyst-cooperative/pudl#1820 @cmgosnell
- catalyst-cooperative/pudl#1735 @aesharpe
catalyst-cooperative/pudl#1807 @zaneselvans & @cmgosnell
Merge catalyst-cooperative/pudl#1665

Phase 3: Integrate New FERC 1 Data & Methods

Now that we’ve got the 2021 data integrated with and working as well as the 2020 data, we’ll move on to expanding coverage to other tables, in both the earlier DBF data and new XBRL data.
At this point the software design will hopefully be stable and able to deal with whatever new problems we encounter.
We will also start integrating new data cleaning methods that we haven’t previously employed in our published data.
Work in this phase will branch off of and merge into dev.
Which tables and data cleaning to prioritize will be guided by @arengel & @jrea-rmi, along the lines of #1568

key

⭐ = DBF-XBRL mapping is simple & worth delegating, table is ready to be taken on
🟧 = table is ready to be taken on, but may require new reshaping transforms

Issues

catalyst-cooperative/pudl#2040 @zaneselvans & @aesharpe
#2110 @zaneselvans
catalyst-cooperative/pudl#2012 @zaneselvans
- catalyst-cooperative/pudl#2021 @zaneselvans
- #2075
- catalyst-cooperative/pudl#2014 @zaneselvans
#1804
- catalyst-cooperative/pudl#1807 @zaneselvans & @cmgosnell
- #1805 ⭐
- #1806 @cmgosnell
- #1808 (one-to-many, 4 XBRL tables) @zaneselvans
#1809
- #1810 ⭐
- #1811 (one-to-many, 8 XBRL tables)
#1812
- #1816 (one-to-one) 🟧 @zschira
- #1817 (one-to-one) ⭐
- #1813 (many-to-one, 2 DBF tables) ⭐ 🟧 @cmgosnell
- #1815 (one-to-many, 2 XBRL tables)
#1818
- catalyst-cooperative/pudl#1820 @cmgosnell
- #1819 @cmgosnell

2023 Issues

#471
#2076
catalyst-cooperative/pudl#2016 @zaneselvans & @cmgosnell
- #2074
- catalyst-cooperative/pudl#2015 @zaneselvans & @zschira
#2066
catalyst-cooperative/pudl#1980
Integrate total labeler into small/hydro/pumped transform step @aesharpe
catalyst-cooperative/pudl#1968 @zaneselvans
Generalize our per-row data fixes so we can retain the bad / weird rows that have employees, totals, etc. See e.g. catalyst-cooperative/pudl#1825
Transform other, lower priority RMI tables
- #1822
- #1821 (one-to-many, 8 XBRL tables)
- #1823 (one-to-many, 9 XBRL tables)
- #1824 (one-to-one) 🟧

Issue Analytics

State:
Created a year ago
Comments:13 (13 by maintainers)

Top GitHub Comments

2reactions

bendnormancommented, Jul 6, 2022

Some more thoughts on Dagster:

You can create nested graphs in Dagster to allow for some logic hierarchy. @zaneselvans and I envision incrementally applying dagster graphs to our ETL. There are multiple levels we’ve identified:

Turn our _etl_{dataset}() functions into ops and construct a graph. This is a very simple DAG but would enable us to run the ETLs in separate processes.
Turn our ETL functions into graphs where the ops are the extract and transform steps.
ETL functions are graphs, E and T steps are graphs and individual table transforms are ops.
ETL functions are graphs, E and T steps are graphs, individual table transforms are graphs and individual reusable cleaning functions are ops.

When we’re writing XBRL functions we don’t need to think much about the first two levels because we are only making changes within the ferc1.transform() function.

Option 3

I wrote up a little prototype for option 3. It’s ok but it seems awkward to have to define all of the transformed tables in the transform graph’s outs param:

https://github.com/catalyst-cooperative/pudl/blob/aa36c91766e70eb3a178232655f7c79fb60b6a87/notebooks/work-in-progress/ferc_dagster_prototype.py#L25-L33

This is the recommended method for returning multiple outputs from graphs. I’m curious if it is possible for graphs to treat dictionaries as a single output instead of multiple.

Satisfying option three shouldn’t be that difficult because we can use the existing structure in transform.ferc1. I do have a couple of questions:

Do the table transform functions depend on previously cleaned tables? It looks like the plants_steam() depends on the fuel_ferc1() table for plant_id assignment.
Do the transform functions depend on multiple tables in ferc1_raw_dfs? This might not be the case in Form 1 land but I’m pretty sure final EIA transform functions rely on multiple raw tables to produce a single cleaned table.

Option 4

Option 4 is a bit trickier because we want generic cleaning functions that can be parameterized for each table but do not have to use dagster config. Dagster recommends using Op factories for this situation. They work but feel a little kludgy. Here is an example of a generic transform op:

def rename_columns_factory(
    name="default_name",
    ins=None,
    column_mapping=None,
    **kwargs,
):
    """
    Args:
        name (str): The name of the new op.
        ins (Dict[str, In]): Any Ins for the new op. Default: None.

    Returns:
        function: The new op.
    """

    @op(name=name, ins=ins, **kwargs)
    def rename_df(context, df):
        context.log.info(f"\n The DataFrame: {df}\n")
        context.log.info(f"\n The Op Ins: {context.op_def.ins}\n")
        t_df = df.rename(columns=column_mapping)
        context.log.info(f"\n The Transformed DataFrame: {t_df}\n")
        return t_df

    return rename_df

@op
def extract():
    return pd.DataFrame([1,2], columns=["col"])

@job()
def etl():
    df = extract()
    column_mapping = {"col": "column"}
    transformed_df = rename_columns_factory(column_mapping=column_mapping)(df)

etl.execute_in_process()

rename_columns_factory() parametrizes the inner function rename_df() which is an op. It’s kind of mind-bending because there is a lot of function wrapping / decorating going on here. If we like this pattern, this is what a dagster friendly version without dagster abstractions would look like:

def rename_columns_factory(
    column_mapping=None,
):
    """
    Args:
        column_mapping: Dict of column rename mappings. 

    Returns:
        function: the rename_df function.
    """
    def rename_df(df):
        print(f"\n The DataFrame: {df}\n")
        t_df = df.rename(columns=column_mapping)
        print(f"\n The Transformed DataFrame: {t_df}\n")
        return t_df

    return rename_df

def extract():
    return pd.DataFrame([1,2], columns=["col"])

def etl():
    df = extract()
    column_mapping = {"col": "column"}
    transformed_df = rename_columns_factory(column_mapping=column_mapping)(df)

etl()

An open question here is where we want to store the transform parameters.

1reaction

bendnormancommented, Jul 29, 2022

I’m kind of cooling off on converting cleaning functions to dagster ops:

One reason for converting cleaning functions to dagster ops is to validate data after each cleaning operation using dagster-pandera. How important is this to us? I think unit testing cleaning functions and using dagster-pandera to validate the outputs of table transform ops should be adequate. We also haven’t talked about how to store and access the hundreds of pandera schemas for each cleaning step of every PUDL table.
Cleaning functions could be documented in the dagit UI and parallelized if they are ops. A majority of cleaning functions are applied sequentially, so there aren’t many complex dependencies or opportunities to parallelize. It would be nice to view all of the cleaning functions applied to a table in Dagit, but users could also just visit the source code. I posted about this design question in the dagster community, and one engineer at dagster thought it might make more sense to keep our cleaning functions as pure python functions or methods.
It seems like many contributors want to tweak how tables are transformed. If we kept the cleaning functions as pure python, these contributors wouldn’t need to learn dagster concepts. However, people contributing new datasets or tables would need to wrap functions in dagster abstractions.
Storing transform metadata and logic in classes could reduce coupling. This way, table transform functions, and cleaning functions don’t need to know about the structure of the metadata.