Transform `plant_in_srvce` xbrl + dbf
See original GitHub issueThe Plant in Service table is the only “row mapped” table that we’ve already pulled into PUDL. Even though it’s not the highest priority of this type of table, we want tackle it first so we can learn from it, and adapt the new transform process to accommodate it, since there are lots of other tables like this.
- The DBF data is row-oriented, with each row number pertaining to a different FERC account number, or to subtotals and totals of various groups of related FERC accounts, and different columns representing starting & ending balances, additions, transfers, subtractions, etc.
- The XBRL data is column-oriented, with different columns representing different FERC account numbers and the additions/retirements/transfers/etc. This results in more than 400 columns.
- We’ve decided to go with the “tidy” or “long” format for these tables, with each column representing a different quantity, and the rows containing identifying information about that quantity.
For the plant_in_service
table, this means we’ll end up with 6 columns, which happen to correspond to the structure that we find int he DBF tables, but with static IDs rather than annually varying row numbers for the different FERC Accounts. The columns will be:
starting_balance
(XBRL instant)additions
(XBRL duration)retirements
(XBRL duration)adjustments
(XBRL duration)transfers
(XBRL duration)ending_balance
(XBRL instant)
XBRL Taxonomy Metadata
To effectively aggregate the values in the above columns, we need some additional metadata, available from the XBRL Taxonomy:
- The groupings of FERC accounts are stable and applied uniformly in many contexts because they are important for filing taxes appropriately. We want to preserve as much of that structure as possible so that both the individual accounts, and their meaningful groupings can be analyzed.
- We need to take care that the sign convention for different rows/columns are propagated and standardized. E.g. the
retirements
column is acredit
while all the others aredebit
, but the convention flips in rows that represent sales of equipment rather than purchases.
Table Notes
pis_dbf = pd.read_sql("f1_plant_in_srvce", ferc1_engine)
pis_xbrl_duration = pd.read_sql("electric_plant_in_service_204_duration", ferc1_xbrl_engine)
pis_xbrl_instant = pd.read_sql("electric_plant_in_service_204_instant", ferc1_xbrl_engine)
- DBF has start/end balance + add/retire/adjust/transfer, as rows w/ labels accessible in the
f1_row_lit_tbl
- XBRL data has legible column names but no account numbers (though they are available in the XBRL taxonomy)
- XBRL Instant has one number for each account or grouping. Turns out these are “end of last year” and “end of this year” balances, which we can transform into
starting_balance
andending_balance
in the current year. However we have to do some reshaping of the instant table to make this work (turning 2 years of 1 group of columns into 1 year of 2 groups of columns). - XBRL Duration table has has columns w/ legible names but no FERC Acct numbers.
- There are almost 500 XBRL columns: ~100 different variables, with 6 variables reported for each one.
- DBF data has a mix of header, subheader, total, subtotal, FERC account and a few other numerical values.
- XBRL seems to have clean naming, but names alone can’t be used to group the categories.
- XBRL has FERC accounts in the metadata.
- Seems like it makes sense to adopt the XBRL column names as the new labels for the old (and variable) DBF row numbers.
Tasks
A bespoke reshaping transformation has been implemented via #2025 but we need some additional metadata to enable all the aggregations, which @cmgosnell has communicated is the next priority for RMI.
- Read XBRL taxonomy JSON into a dataframe, retaining the name, account, calculation, and balance columns.
- Normalize the account column to contain a simple string value.
- Figure out how to select just the relevant XBRL values for the
plant_in_service
table from the larger dataframe - Figure out how to / whether we can reshape the wide-format categories into a table of metadata that applies uniformly across the 6 columns we are retaining. yes we can
- Implement renaming of instant & duration XBRL tables so they follow a programmatically usable naming convention.
- Compile column sign conventions in a dictionary.
- Fix dev notebook to work with all the renamed columns.
- Fix the overwhelming warnings resulting from
column_rename()
- Fix overwhelming warnings from duplicat
record_id
in reshaped tables. - Fix bad XBRL multi-index construction that is scrambling all the reported values.
- Pull draft metadata extraction functions into module & apply sign conventions in
transform_main()
- Split
merge_metadata()
andapply_sign_conventions()
into two methods - Simplify / clarify calculation empty list mess.
- Use just one name for
xbrl_metadata_json
.
Later tasks
Issue Analytics
- State:
- Created a year ago
- Comments:16 (11 by maintainers)
Top GitHub Comments
I’ll be following this. I need to review the PUDL output version of this table, compare to the version we made for the Utility Transition Hub for combining with the balance sheet, and may have suggested edits.
The aggreagations I see us doing for the plant in service table are to the technology level, with or without asset retirement costs.
I figure the aggregation by technology is in the XBRL taxonomy, and filtering out the asset retirement costs can be done based on listing the rows to exclude before aggregating.
Everything in the plant in service table is groupby.sum(), but agree that will not work in general for other tables that have minus signs in their calculated fields, without a label or aggregation function. I’m still curious to see a case where a field has a different sign for different aggregations, as far as I’ve seen each record has a single sign convention.