Refine generic table transform architecture
See original GitHub issueWe did a first draft of refactoring our transform functions to separate parameters, processes, and data in #1721, #1722, and #1739 focused just on the FERC Form 1 transforms, and integrating the new XBRL data into that process. But the abstractions are more general, and should be organized separately from the FERC Form 1, and so this issue is a place to gather additional refinements and tasks. Maybe it’ll become an Epic?
- Require that
AbstractTableTransformer.table_id
correspond to a valid database table which is part of the ETL group - Standardize per-table and per-column logging on multi-column transformations so it happens automatically wherever they are deployed.
- Integrate the generic column / multi-column / table transformation functions into methods in the
AbstractTableTransformer
abstract base class. - Store the data source specific
TransformParams
separate from the data source specific transformers. This could be as JSON on disk underpackage_data
, or in a separate module (to keep sub-units reusable, rather than writing them out to disk many times, e.g. a unit conversion that can be used in more than one place). - Separate the generic transformation infrastructure into its own module(s). This would include:
- A library of generally useful column transformer functions
- The multi-column transformer factory function
- The multi-column transformers that are derived from the column transformers
- The
AbstractTableTransformer
base class which will use the above functions in its methods - A library of the generally useful
TransformParams
classes we’ve defined.
- Split transform() into standard early, mid, late phases in ABC
- Make fuel bad row dropper use parameterized function that Christina defined
- Separate fuel bad row dropper from total row dropper (which is fuel specific)
- Figure out why autoreload isn’t updating things defined in
__init__.py
- Remove
_multicol
fromAbstractTableTransformer
method names – it transforms tables. They’re all DataFrames. - In transform methods, set default arguments to refer to the stored internal params dictionary (but let them be overridden if necessary). It’s verbose and redundant as it is now.
Transform Refactor Design Notes
Questions & issues, some pulled from this comment on #1739.
Implemented
Design issues that were addressed in the initial refactor, as of the merging of PR #1919
Parameterize dropping null-ish rows
- In the FERC data in particular, we sometimes decide to drop rows that have too many NA values in the “data columns” can this be parameterized clearly so we can more easily see and control the criteria on a table-by-table basis?
- This was implemented, via the
drop_invalid_rows()
function andInvalidRows
model.
Access self.params
directly by default
- Rather than having the transform methods which depend on
TransformParams
take parameter arguments, they could instead access theself.params
property directly. would this be a better arrangement? - Implemented this with the default behavior that a transform function will look up its own arguments in
self.params
if it doesn’t get any explicit arguments passed in. This allows the method calls to be simple and non-repetitive in most cases, but also allows you to override when necessary. E.g. in the case of the XBRL & DBF column renaming, we’re using the baserename_columns()
function, but explicitly feeding it separate parameters depending on whether we’re in the XBRL or DBF branch of the transformation. This avoids the need to define separaterename_columns_xbrl()
andrename_columns_dbf
functions, or override therename_columns()
method with a new method that takes a different set of parameters.
Ensure TableTransformParams
match TableTransformer
Right now the contents of the various TransformParams
objects are specified in a big dictionary in their own parameter module, which corresponds to a particular data source (e.g. ferc1
). How do we validate that the TableTransformParams
object that’s either passed into or looked up by a given TableTransformer
class is actually valid for that class? Can we / Should we check that it defines all of the required parameters? And that it doesn’t specify any extra parameters which are irrelevant to the class?
This has been handled by defining a generic TableTransformParams
class that contains the parameter models for all of the transformations associated with the AbstractTableTransformer
. Each dataset specific child class then defines a class that inherits from the generic class, and adds any additional parameter models that are required for the dataset. When these models are instantiated, Pydantic validates them all and ensures that they are of the types expected, with appropriate values.
Transformations which only apply to a single table (e.g. standardize_physical_fuel_units()
or aggregate_duplicate_fuel_types_xbrl()
don’t have externally defined parameter models – they keep that information inside themselves, since it doesn’t apply anywhere else and is unique in structure to the problem that they are solving.
One tweak that was required to make this work well was ensuring that every TransformParams
+ transform function pair can be specified such that they do nothing in the event that they are left unspecified in the TableTransformParams
model – their default values are such that they can be applied, but without having any effect. For almost all of the existing transformations there was an intuitive way to implement this. The only exception was drop_invalid_rows()
/ InvalidRows
, which now has a special case of simply returning the input dataframe if all of the parameters for the transformation are None
.
Deferred
Stuff that’s worth thinking about, but may or may not get done, and won’t be done now in any case.
Coordinating transform()
and (irregular) inter-table dependencies
What is the best way to pass dfs (raw and transformed) where they need to go? See discussion in #1574 and protoypes in #1724. It doesn’t seem possible to fully standardize given special dependencies between some tables (e.g. steam needs fuel). The Dagster named inputs/outputs will explicitly declare all dependencies between tables:
- In the coordinating
transform()
we always load all of the tables so we can be explicit which raw/transform tables are fed into each table transform. - Use explicit per-table arguments, but with some settings checking before each transformation, e.g.
if table_name in ferc1_settings.tables: ...
- Do a standard thing for all the tables, but with a special case allowing steam to take two arguments. i thiiiiink we could do something like:
ferc1_tfr_dfs = {}
# make all the non-steam tables
for table in ferc1_settings.tables:
ferc1_tfr_dfs[table_name] = global().get(table_name)(
ferc1_dbf_raw_dfs.get(table_name),
ferc1_xbrl_raw_dfs.get(table_name)
)
# make the steam table using fuel
if "plants_steam_ferc1" in ferc1_settings.tables:
ferc1_tfr_dfs["plants_steam_ferc1"] = plants_steam_ferc1(
steam_dbf_raw=ferc1_dbf_raw_dfs.get("plants_steam_ferc1"),
steam_xbrl_raw=ferc1_dbf_raw_xbrl.get("plants_steam_ferc1"),
fuel_transformed=ferc1_tfr_dfs.get("fuel_ferc1")
)
- We still haven’t really dealt with this, and it’s special cases at the moment. Dagster will deal with this entirely differently. Maybe it’s best to just keep special cases for now and then refactor with Dagster.
Distinguishing “data” columns
- The idea of distinguishing “data columns” from other non-data columns and treating them differently shows up in multiple places. Can we make it explicit and use it as a parameter that controls behavior, rather than hard-coding these lists of columns in functions and methods? These hard-coded lists are often a source of pain when we’re adding / removing columns, or renaming things.
- This seems like a good idea in general, but also like a deeper design issue – storing more information about the kind of information that’s stored in each column (categorical attributes, numerical values that can be summed, numerical values that can’t be summed, etc.) probably better to save for another day.
Transform Function & Parameter Interface
- Currently the column transform functions are defined outside of the
TableTransformer
classes, turned into multi-column transform functions by a factory function, and then wrapped by simple methods that handle per-table logging inside theTableTransformer
class. This seems a little janky. Is ther esome way to avoid the need to re-write this boilerplate code for wrapping the transform function with a method and doing the logging? When defining aTableTransformer
class is there some way to hand it a list of column & table transform functions, and have it build them into methods automatically, including applying the column-to-multicolumn factory function? - We currently have 3
Protocol
classes that define the interfaces for column, multi-column, and table transform functions, including the kind of parameter objects they require (genericTransformParams
for the column and table transforms, andMultiColumnTransformParams
for the multi-column transforms). What’s the right way to make use of these interfaces? Do they only help with IDE integration? Would it be helpful to have some type linting enabled? If the interface is violated in code right now… does anything actually happen? How should we deal with transformations that require more than one input dataframe (e.g. the FERC plant ID assignment)? Should these functions be allowed to take 1 or more dataframes? Or should those special transformations simply not be tied to this interface? - How do these
Protocol
definitions relate to the methods that wrap functions which implement theProtocol
? I guess they also implement it… except that they also take aself
argument.
Implement DatasetTransformParams
class
- Currently there’s no
DatasetTransformParams
model.TRANSFORM_PARAMS
is just a dictionary keyed by table ID. I imagine there being one of these for each dataset (e.g. ferc1, eia923). Given that the keys are database table IDs, there’s potential for important validations. - However, I don’t think this will be will be worthwhile until we’re actually applying this transform architecture to several datasets. Doing that work will also better inform the design.
Validate Params without Models
I don’t like that for the plain dictionary TransformParams
we still have to have a single attribute inside the Pydantic class (e.g. the columns
in RenameColumns
). We can make the __root__
of the Pydantic model into a dict
but the model doesn’t automatically behave like a dict in that case – you still have to add all the dict-like methods to it. __getitem__()
and __iter__()
were not sufficient to make it work with df.rename()
so I gave up and went back to having a named attribute.
Apparently this will be something that gets a lot easier in Pydantic v2, so I’ve just decided to put it off for now. It shouldn’t be too hard to refactor after v2 is out.
Discarded / WontFix
Storing TransformParams
inside `TableTransformer
The dictionary of TransformParams
that pertain to a given table could be stored within the TableTransformer
class itself, rather than a constant dictionary in a different external structure (right now it’s just getting read in from the constant and stored in self.params
and we could have a separate structure which is a collection of all the TableTransformer
classes that pertain to the same data source. From that collection, you’d be able to compile a complete set of all the TransformParams
to see what parameters are being used in the process.
Decided not to do this, since it’s useful to be able to pass in TransformParams
from outside for testing, and for potentially re-using the same TableTransformer
multiple times, with different data and different parameters. It’s also nice to be able to store the (extensive) parameterizations elsewhere so they can be compared with each other and so they don’t clutter up the classes themselves, which define the behavior.
Constrain and Validate TableTransformer
construction
The table_id
and potentially etl_group
or DataSource
associated with a TableTransformer
(or the collection of all TableTransformer
classes that are associated with an ETL Group) should come from a controlled vocabulary. You can’t define a table transformer for a table that’s not part of the database. What’s the right way to store / access / enforce this? Right now I’ve hard coded an Enum in the pudl.transform.ferc1
module but that seems wrong. This list of all valid tables, or a particular subset of the valid tables associated with an ETL group can be derived from a Package
, and we’re already using Package.from_resource_ids()
to enforce the table schema at the end of the transform()
method. Should the Package
be an attribute of the AbstractTableTransformer
? Should there be another layer of classes in here that encompasses all of e.g. the FERC 1 TableTransformers, with that subclass restricted to have table_id
values that are part of the FERC 1 etl_group
? Enums can be dynamically constructed with code like the below too:
from pudl.metadata.classes import Package
from enum import Enum
# See also: https://pydantic-docs.helpmanual.io/usage/types/#enums-and-choices
pkg = Package.from_resource_ids()
etl_groups = sorted(set(res.etl_group for res in pkg.resources))
EtlGroup = Enum("EtlGroup", {eg.upper(): eg for eg in etl_groups}, type=str)
def table_id_enum_factory(etl_group: EtlGroup) -> Enum:
pkg = Package.from_resource_ids()
enum_name = etl_group.name.title().replace("_", "") + "TableId"
return Enum(enum_name, {res.name.upper(): res.name for res in pkg.resources if res.etl_group == etl_group})
Ferc1TableId = table_id_enum_factory(etl_group=EtlGroup.FERC1)
Decided not to try and do anything like this for now. Tightly coupling what the TableTransformers can be told to do based on our particular database structure seems like it would make a mess. Instead, for now, we can just construct the allowed table_id
values either by hand or dynamically (as suggested above) as appropriate for each data source.
Issue Analytics
- State:
- Created a year ago
- Comments:5 (5 by maintainers)
Top GitHub Comments
I pulled them out here partly because you’d expressed a preference for not doing them in the steam/fuel PR. So I was thinking I would do them in a separate PR after we get the steam/fuel merged into the
xbrl_integration
branch. But before we dive into doing a bunch of other tables.I reviewed the disposition of all these items with @cmgosnell and we agreed that they made sense.