Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Refine generic table transform architecture

See original GitHub issue

We did a first draft of refactoring our transform functions to separate parameters, processes, and data in #1721, #1722, and #1739 focused just on the FERC Form 1 transforms, and integrating the new XBRL data into that process. But the abstractions are more general, and should be organized separately from the FERC Form 1, and so this issue is a place to gather additional refinements and tasks. Maybe it’ll become an Epic?

Require that AbstractTableTransformer.table_id correspond to a valid database table which is part of the ETL group
Standardize per-table and per-column logging on multi-column transformations so it happens automatically wherever they are deployed.
Integrate the generic column / multi-column / table transformation functions into methods in the AbstractTableTransformer abstract base class.
Store the data source specific TransformParams separate from the data source specific transformers. This could be as JSON on disk under package_data, or in a separate module (to keep sub-units reusable, rather than writing them out to disk many times, e.g. a unit conversion that can be used in more than one place).
Separate the generic transformation infrastructure into its own module(s). This would include:
- A library of generally useful column transformer functions
- The multi-column transformer factory function
- The multi-column transformers that are derived from the column transformers
- The AbstractTableTransformer base class which will use the above functions in its methods
- A library of the generally useful TransformParams classes we’ve defined.
Split transform() into standard early, mid, late phases in ABC
Make fuel bad row dropper use parameterized function that Christina defined
Separate fuel bad row dropper from total row dropper (which is fuel specific)
Figure out why autoreload isn’t updating things defined in __init__.py
Remove _multicol from AbstractTableTransformer method names – it transforms tables. They’re all DataFrames.
In transform methods, set default arguments to refer to the stored internal params dictionary (but let them be overridden if necessary). It’s verbose and redundant as it is now.

Transform Refactor Design Notes

Questions & issues, some pulled from this comment on #1739.

Implemented

Design issues that were addressed in the initial refactor, as of the merging of PR #1919

Parameterize dropping null-ish rows

In the FERC data in particular, we sometimes decide to drop rows that have too many NA values in the “data columns” can this be parameterized clearly so we can more easily see and control the criteria on a table-by-table basis?
This was implemented, via the drop_invalid_rows() function and InvalidRows model.

Access `self.params` directly by default

Rather than having the transform methods which depend on TransformParams take parameter arguments, they could instead access the self.params property directly. would this be a better arrangement?
Implemented this with the default behavior that a transform function will look up its own arguments in self.params if it doesn’t get any explicit arguments passed in. This allows the method calls to be simple and non-repetitive in most cases, but also allows you to override when necessary. E.g. in the case of the XBRL & DBF column renaming, we’re using the base rename_columns() function, but explicitly feeding it separate parameters depending on whether we’re in the XBRL or DBF branch of the transformation. This avoids the need to define separate rename_columns_xbrl() and rename_columns_dbf functions, or override the rename_columns() method with a new method that takes a different set of parameters.

Ensure `TableTransformParams` match `TableTransformer`

Right now the contents of the various TransformParams objects are specified in a big dictionary in their own parameter module, which corresponds to a particular data source (e.g. ferc1). How do we validate that the TableTransformParams object that’s either passed into or looked up by a given TableTransformer class is actually valid for that class? Can we / Should we check that it defines all of the required parameters? And that it doesn’t specify any extra parameters which are irrelevant to the class?

This has been handled by defining a generic TableTransformParams class that contains the parameter models for all of the transformations associated with the AbstractTableTransformer. Each dataset specific child class then defines a class that inherits from the generic class, and adds any additional parameter models that are required for the dataset. When these models are instantiated, Pydantic validates them all and ensures that they are of the types expected, with appropriate values.

Transformations which only apply to a single table (e.g. standardize_physical_fuel_units() or aggregate_duplicate_fuel_types_xbrl() don’t have externally defined parameter models – they keep that information inside themselves, since it doesn’t apply anywhere else and is unique in structure to the problem that they are solving.

One tweak that was required to make this work well was ensuring that every TransformParams + transform function pair can be specified such that they do nothing in the event that they are left unspecified in the TableTransformParams model – their default values are such that they can be applied, but without having any effect. For almost all of the existing transformations there was an intuitive way to implement this. The only exception was drop_invalid_rows() / InvalidRows, which now has a special case of simply returning the input dataframe if all of the parameters for the transformation are None.

Deferred

Stuff that’s worth thinking about, but may or may not get done, and won’t be done now in any case.

Coordinating `transform()` and (irregular) inter-table dependencies

What is the best way to pass dfs (raw and transformed) where they need to go? See discussion in #1574 and protoypes in #1724. It doesn’t seem possible to fully standardize given special dependencies between some tables (e.g. steam needs fuel). The Dagster named inputs/outputs will explicitly declare all dependencies between tables:

In the coordinating transform() we always load all of the tables so we can be explicit which raw/transform tables are fed into each table transform.
Use explicit per-table arguments, but with some settings checking before each transformation, e.g. if table_name in ferc1_settings.tables: ...
Do a standard thing for all the tables, but with a special case allowing steam to take two arguments. i thiiiiink we could do something like:

ferc1_tfr_dfs = {}
# make all the non-steam tables
for table in ferc1_settings.tables:
    ferc1_tfr_dfs[table_name] = global().get(table_name)(
        ferc1_dbf_raw_dfs.get(table_name),
        ferc1_xbrl_raw_dfs.get(table_name)
    )
# make the steam table using fuel
if "plants_steam_ferc1" in ferc1_settings.tables:
   ferc1_tfr_dfs["plants_steam_ferc1"] = plants_steam_ferc1(
        steam_dbf_raw=ferc1_dbf_raw_dfs.get("plants_steam_ferc1"), 
        steam_xbrl_raw=ferc1_dbf_raw_xbrl.get("plants_steam_ferc1"),
        fuel_transformed=ferc1_tfr_dfs.get("fuel_ferc1")
    )

We still haven’t really dealt with this, and it’s special cases at the moment. Dagster will deal with this entirely differently. Maybe it’s best to just keep special cases for now and then refactor with Dagster.

Distinguishing “data” columns

The idea of distinguishing “data columns” from other non-data columns and treating them differently shows up in multiple places. Can we make it explicit and use it as a parameter that controls behavior, rather than hard-coding these lists of columns in functions and methods? These hard-coded lists are often a source of pain when we’re adding / removing columns, or renaming things.
This seems like a good idea in general, but also like a deeper design issue – storing more information about the kind of information that’s stored in each column (categorical attributes, numerical values that can be summed, numerical values that can’t be summed, etc.) probably better to save for another day.

Transform Function & Parameter Interface

Currently the column transform functions are defined outside of the TableTransformer classes, turned into multi-column transform functions by a factory function, and then wrapped by simple methods that handle per-table logging inside the TableTransformer class. This seems a little janky. Is ther esome way to avoid the need to re-write this boilerplate code for wrapping the transform function with a method and doing the logging? When defining a TableTransformer class is there some way to hand it a list of column & table transform functions, and have it build them into methods automatically, including applying the column-to-multicolumn factory function?
We currently have 3 Protocol classes that define the interfaces for column, multi-column, and table transform functions, including the kind of parameter objects they require (generic TransformParams for the column and table transforms, and MultiColumnTransformParams for the multi-column transforms). What’s the right way to make use of these interfaces? Do they only help with IDE integration? Would it be helpful to have some type linting enabled? If the interface is violated in code right now… does anything actually happen? How should we deal with transformations that require more than one input dataframe (e.g. the FERC plant ID assignment)? Should these functions be allowed to take 1 or more dataframes? Or should those special transformations simply not be tied to this interface?
How do these Protocol definitions relate to the methods that wrap functions which implement the Protocol? I guess they also implement it… except that they also take a self argument.

Implement `DatasetTransformParams` class

Currently there’s no DatasetTransformParams model. TRANSFORM_PARAMS is just a dictionary keyed by table ID. I imagine there being one of these for each dataset (e.g. ferc1, eia923). Given that the keys are database table IDs, there’s potential for important validations.
However, I don’t think this will be will be worthwhile until we’re actually applying this transform architecture to several datasets. Doing that work will also better inform the design.

Validate Params without Models

I don’t like that for the plain dictionary TransformParams we still have to have a single attribute inside the Pydantic class (e.g. the columns in RenameColumns). We can make the __root__ of the Pydantic model into a dict but the model doesn’t automatically behave like a dict in that case – you still have to add all the dict-like methods to it. __getitem__() and __iter__() were not sufficient to make it work with df.rename() so I gave up and went back to having a named attribute.

Apparently this will be something that gets a lot easier in Pydantic v2, so I’ve just decided to put it off for now. It shouldn’t be too hard to refactor after v2 is out.

Discarded / WontFix

Storing `TransformParams` inside `TableTransformer

The dictionary of TransformParams that pertain to a given table could be stored within the TableTransformer class itself, rather than a constant dictionary in a different external structure (right now it’s just getting read in from the constant and stored in self.params and we could have a separate structure which is a collection of all the TableTransformer classes that pertain to the same data source. From that collection, you’d be able to compile a complete set of all the TransformParams to see what parameters are being used in the process.

Decided not to do this, since it’s useful to be able to pass in TransformParams from outside for testing, and for potentially re-using the same TableTransformer multiple times, with different data and different parameters. It’s also nice to be able to store the (extensive) parameterizations elsewhere so they can be compared with each other and so they don’t clutter up the classes themselves, which define the behavior.

Constrain and Validate `TableTransformer` construction

The table_id and potentially etl_group or DataSource associated with a TableTransformer (or the collection of all TableTransformer classes that are associated with an ETL Group) should come from a controlled vocabulary. You can’t define a table transformer for a table that’s not part of the database. What’s the right way to store / access / enforce this? Right now I’ve hard coded an Enum in the pudl.transform.ferc1 module but that seems wrong. This list of all valid tables, or a particular subset of the valid tables associated with an ETL group can be derived from a Package, and we’re already using Package.from_resource_ids() to enforce the table schema at the end of the transform() method. Should the Package be an attribute of the AbstractTableTransformer? Should there be another layer of classes in here that encompasses all of e.g. the FERC 1 TableTransformers, with that subclass restricted to have table_id values that are part of the FERC 1 etl_group? Enums can be dynamically constructed with code like the below too:

from pudl.metadata.classes import Package
from enum import Enum

# See also: https://pydantic-docs.helpmanual.io/usage/types/#enums-and-choices
pkg = Package.from_resource_ids()
etl_groups = sorted(set(res.etl_group for res in pkg.resources))
EtlGroup = Enum("EtlGroup", {eg.upper(): eg for eg in etl_groups}, type=str)

def table_id_enum_factory(etl_group: EtlGroup) -> Enum:
    pkg = Package.from_resource_ids()
    enum_name = etl_group.name.title().replace("_", "") + "TableId"
    return Enum(enum_name, {res.name.upper(): res.name for res in pkg.resources if res.etl_group == etl_group})

Ferc1TableId = table_id_enum_factory(etl_group=EtlGroup.FERC1)

Decided not to try and do anything like this for now. Tightly coupling what the TableTransformers can be told to do based on our particular database structure seems like it would make a mess. Instead, for now, we can just construct the allowed table_id values either by hand or dynamically (as suggested above) as appropriate for each data source.

Issue Analytics

State:
Created a year ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

zaneselvanscommented, Aug 23, 2022

I pulled them out here partly because you’d expressed a preference for not doing them in the steam/fuel PR. So I was thinking I would do them in a separate PR after we get the steam/fuel merged into the xbrl_integration branch. But before we dive into doing a bunch of other tables.

0reactions

zaneselvanscommented, Sep 28, 2022

I reviewed the disposition of all these items with @cmgosnell and we agreed that they made sense.