question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Integrate FERC XBRL data into PUDL

See original GitHub issue

Background

After creating tools to translate the FERC XBRL filings into SQLite databases, we decided that the old Visual FoxPro DBF and new XBRL data will need their own independent extract + transform processes. The new data has much more structure and metadata, and will be quite a bit easier to work with than the historical data, so it doesn’t make sense to convert the new data into the old structure just so we can run it through the same old transformations (this is discussed in catalyst-cooperative/pudl#1579).

This means a lot of new code and transformations, and has precipitated a major refactor of the FERC Form 1 transformations – especially since we are going to be going after many additional Form 1 tables beyond the ones we’ve already cleaned up and integrated into the PUDL database.

Now that we have access to the raw XBRL data, we’ve been working on areas in parallel:

  • Making the XBRL data acquisition & extraction process more robust and automated, and
  • Re-writing the transform functions to accommodate both XBRL and DBF data, and to be more modular and re-usable as we expand our coverage to additional tables.

FERC XBRL Data Acquisition (@zschira)

  • This includes the data scraping and archiving process, up to the point of having versioned Zenodo depositions available through the PUDL Datastore, and regularly updated with minimal human intervention.
  • This work will take place in the pudl-scrapers & pudl-zenodo-storage repositories.

Issues

Pre-extract (@zschira)

  • This includes everything that takes the raw inputs archived on Zenodo and turns them into coherent SQLite databases which we can archive, publish and use as standalone resources.
  • We will also produce detailed machine-readable metadata comparable in detail to what is available in the XBRL.
  • This work will primarily take place in the ferc-xbrl-extractor repository.
  • Updates that impact the main PUDL repo will be reflected on the xbrl_integration branch.

Release Issues

Post-release Issues

Update Existing ETL

Updating our transformations is a mix of software engineering and data wrangling tasks. We want to get the software somewhat stable and documented before involving lots of people working in parallel on unfamiliar data, and so we’ve broken this into 3 phases of work:

Phase 1: Software Design and Alpha Testing (@zaneselvans & @cmgosnell)

  • Get to the point where the fuel_ferc1 and plants_steam_ferc1 are loading into the DB successfully, and provide functionality comparable to the old DBF, extending data coverage through 2021.
  • Work on this phase will branch off of and be merged into xbrl_steam.
  • Unit and Integration tests should exist and pass, at which point we can merge the xbrl_steam branch into xbrl_integration.
  • After this phase, documentation should be good enough, and the design stable enough that we can bring in other people to work on additional tables that we have existing transforms for, and familiarity with.

Issues

  • catalyst-cooperative/pudl#1739
  • catalyst-cooperative/pudl#1706
  • catalyst-cooperative/pudl#1738
  • catalyst-cooperative/pudl#1722
  • catalyst-cooperative/pudl#1707
  • catalyst-cooperative/pudl#1876
  • catalyst-cooperative/pudl#1853
  • catalyst-cooperative/pudl#1878
  • catalyst-cooperative/pudl#1877
  • catalyst-cooperative/pudl#1705
  • Merge catalyst-cooperative/pudl#1721
  • catalyst-cooperative/pudl#1924
  • Merge catalyst-cooperative/pudl#1962

Phase 2: Beta Testing w/ Familiar Tables (#1801)

  • Refactor all of our our existing FERC Form 1 transform functions to use the new framework, extending coverage to the 2021 XBRL data.
  • This will include creating additional transform functions and parameterizations as needed to deal with more kinds of tables and data problems not encountered in the fuel_ferc1 and plants_steam_ferc1 tables.
  • Based on feedback from this experience we may also make some changes to the transform framework.
  • Work in this phase will branch off of and merge back into the xbrl_integration branch.
  • When this phase is complete, and all unit & integration tests are passing, we will merge xbrl_integration into dev in the PUDL repository, and can make an initial release of this data publicly.

Issues

Phase 3: Integrate New FERC 1 Data & Methods

  • Now that we’ve got the 2021 data integrated with and working as well as the 2020 data, we’ll move on to expanding coverage to other tables, in both the earlier DBF data and new XBRL data.
  • At this point the software design will hopefully be stable and able to deal with whatever new problems we encounter.
  • We will also start integrating new data cleaning methods that we haven’t previously employed in our published data.
  • Work in this phase will branch off of and merge into dev.
  • Which tables and data cleaning to prioritize will be guided by @arengel & @jrea-rmi, along the lines of #1568

key

  • ⭐ = DBF-XBRL mapping is simple & worth delegating, table is ready to be taken on
  • 🟧 = table is ready to be taken on, but may require new reshaping transforms

Issues

2023 Issues

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:13 (13 by maintainers)

github_iconTop GitHub Comments

2reactions
bendnormancommented, Jul 6, 2022

Some more thoughts on Dagster:

You can create nested graphs in Dagster to allow for some logic hierarchy. @zaneselvans and I envision incrementally applying dagster graphs to our ETL. There are multiple levels we’ve identified:

  1. Turn our _etl_{dataset}() functions into ops and construct a graph. This is a very simple DAG but would enable us to run the ETLs in separate processes.
  2. Turn our ETL functions into graphs where the ops are the extract and transform steps.
  3. ETL functions are graphs, E and T steps are graphs and individual table transforms are ops.
  4. ETL functions are graphs, E and T steps are graphs, individual table transforms are graphs and individual reusable cleaning functions are ops.

When we’re writing XBRL functions we don’t need to think much about the first two levels because we are only making changes within the ferc1.transform() function.

Option 3

I wrote up a little prototype for option 3. It’s ok but it seems awkward to have to define all of the transformed tables in the transform graph’s outs param:

https://github.com/catalyst-cooperative/pudl/blob/aa36c91766e70eb3a178232655f7c79fb60b6a87/notebooks/work-in-progress/ferc_dagster_prototype.py#L25-L33

This is the recommended method for returning multiple outputs from graphs. I’m curious if it is possible for graphs to treat dictionaries as a single output instead of multiple.

Satisfying option three shouldn’t be that difficult because we can use the existing structure in transform.ferc1. I do have a couple of questions:

  • Do the table transform functions depend on previously cleaned tables? It looks like the plants_steam() depends on the fuel_ferc1() table for plant_id assignment.
  • Do the transform functions depend on multiple tables in ferc1_raw_dfs? This might not be the case in Form 1 land but I’m pretty sure final EIA transform functions rely on multiple raw tables to produce a single cleaned table.

Option 4

Option 4 is a bit trickier because we want generic cleaning functions that can be parameterized for each table but do not have to use dagster config. Dagster recommends using Op factories for this situation. They work but feel a little kludgy. Here is an example of a generic transform op:

def rename_columns_factory(
    name="default_name",
    ins=None,
    column_mapping=None,
    **kwargs,
):
    """
    Args:
        name (str): The name of the new op.
        ins (Dict[str, In]): Any Ins for the new op. Default: None.

    Returns:
        function: The new op.
    """

    @op(name=name, ins=ins, **kwargs)
    def rename_df(context, df):
        context.log.info(f"\n The DataFrame: {df}\n")
        context.log.info(f"\n The Op Ins: {context.op_def.ins}\n")
        t_df = df.rename(columns=column_mapping)
        context.log.info(f"\n The Transformed DataFrame: {t_df}\n")
        return t_df

    return rename_df

@op
def extract():
    return pd.DataFrame([1,2], columns=["col"])

@job()
def etl():
    df = extract()
    column_mapping = {"col": "column"}
    transformed_df = rename_columns_factory(column_mapping=column_mapping)(df)

etl.execute_in_process()

rename_columns_factory() parametrizes the inner function rename_df() which is an op. It’s kind of mind-bending because there is a lot of function wrapping / decorating going on here. If we like this pattern, this is what a dagster friendly version without dagster abstractions would look like:

def rename_columns_factory(
    column_mapping=None,
):
    """
    Args:
        column_mapping: Dict of column rename mappings. 

    Returns:
        function: the rename_df function.
    """
    def rename_df(df):
        print(f"\n The DataFrame: {df}\n")
        t_df = df.rename(columns=column_mapping)
        print(f"\n The Transformed DataFrame: {t_df}\n")
        return t_df

    return rename_df

def extract():
    return pd.DataFrame([1,2], columns=["col"])

def etl():
    df = extract()
    column_mapping = {"col": "column"}
    transformed_df = rename_columns_factory(column_mapping=column_mapping)(df)

etl()

An open question here is where we want to store the transform parameters.

1reaction
bendnormancommented, Jul 29, 2022

I’m kind of cooling off on converting cleaning functions to dagster ops:

  • One reason for converting cleaning functions to dagster ops is to validate data after each cleaning operation using dagster-pandera. How important is this to us? I think unit testing cleaning functions and using dagster-pandera to validate the outputs of table transform ops should be adequate. We also haven’t talked about how to store and access the hundreds of pandera schemas for each cleaning step of every PUDL table.
  • Cleaning functions could be documented in the dagit UI and parallelized if they are ops. A majority of cleaning functions are applied sequentially, so there aren’t many complex dependencies or opportunities to parallelize. It would be nice to view all of the cleaning functions applied to a table in Dagit, but users could also just visit the source code. I posted about this design question in the dagster community, and one engineer at dagster thought it might make more sense to keep our cleaning functions as pure python functions or methods.
  • It seems like many contributors want to tweak how tables are transformed. If we kept the cleaning functions as pure python, these contributors wouldn’t need to learn dagster concepts. However, people contributing new datasets or tables would need to wrap functions in dagster abstractions.
  • Storing transform metadata and logic in classes could reduce coupling. This way, table transform functions, and cleaning functions don’t need to know about the structure of the metadata.
Read more comments on GitHub >

github_iconTop Results From Across the Web

Explore FERC's new XBRL data format · Issue #1321
FERC has adopted the XBRL standard, based on XML, for Form 1, 2, 6, 60 & 714 (and their “sub-forms – quarterlies and...
Read more >
FERC Form 1 - PUDL documentation
New data is released as a collection of XBRL filings and we are in the process of integrating this data into PUDL. The...
Read more >
Migrated Data Downloads
These are XBRL instance (submission) files created from the historic VFP data. These files can be used to build an XBRL database for...
Read more >
ferc714 Archives
FERC Form 714 covering 2006-2019, but only the table of hourly electricity demand by planning area. This data is still in beta and...
Read more >
Zane Selvans: "Woo! Draft deployment of our e…"
We've only processed a small subset of the FERC Form 1 data for inclusion in PUDL, but in a lot of cases the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found