Estimate monthly plant-level fuel prices w/o using the EIA API

See original GitHub issue

Instead of using the EIA API to pull monthly average fuel costs by state and fuel when individual fuel deliveries have their costs redacted in the fuel_receipts_costs_eia923 table, calculate it for ourselves.

Motivation

This change will address several issues:

The EIA API is missing a fair amount of the data anyway. Sometimes whole state-months are missing. It also only contains data for coarse fuel categories (coal, petroleum, natural gas) rather than the specific fuel types.
Relying on the API means asking users to register for an API key and manage environment variables. This is a barrier for many of our less technical users.
Whenever something goes wrong with the API, our CI tests fail, and we can’t work with this data locally. Over time this has been happening more frequently. HTML gets returned instead of JSON, or the network is down.
EIA is discontinuing the v1 API in November, 2022, so our current setup will stop working anyway.
There’s a lot of information in the fuel_receipts_costs_eia923 table, and related to the plants and mines and suppliers involved. It should be possible to do a fairly good estimation of the fuel prices from scratch given all that context.

Approach

Estimate fuel prices using a variety of aggregations and use them to fill missing values.
Start with the most granular / accurate and progressively apply less specific estimates until everything is filled in.
Tag each record indicating which estimation was used to fill it in.
Pre-calculate all of the aggregations so that we can look at how they compare with actual values first.
Add each of these aggregations to the original FRC dataframe for plotting.
We should also include the EIA API values for comparison / constraint based on the redacted values.
Looking at the EIA API, only PEL, PC, COW, and NG really have values for $/MMBTU at census region and state level.
Seems like the very granular fuel types only have prices for US Total, at least at the monthly level.
Use median values of the fuel prices in $/MMBTU
Maybe calculate a weighted median? Want typical MMBTU, not typical delivery.
@gschivley and Neha suggested using both spatial and temporal interpolation – averaging prices from the adjacent states, and filling in gaps in the monthly time series when possible.
We could also use a low-effort, but powerful estimator like XGBoost or a random forest to try and incorporate much more information, without designing something bespoke from scratch.
We should be able to benchmark these calculations against the data from the API or the specific information reported in the FRC table by doing some random knockouts to see how well we can recreate the reported values.

Choosing Aggregations

How do we decide how to prioritize aggregations?
Coal prices don’t vary much month to month, aggregating annually would have little impact.
Gas & Petroluem prices can vary dramatically month to month, so aggregating across time is bad.
Petroleum fuel prices are highly correlated nationwide, so aggregating geographically has little impact.

Intuitive Aggregation Priorities

Most precise: ["state", "energy_source_code", "report_date"]
Annual aggregation (coal): ["state", "energy_source_code", "report_year"]
Regional aggregation (petroleum): ["census_region", "energy_source_code", "report_date"]
Fuel type aggregation: ["state", "fuel_group_code", "report_date"]
Both regional and fuel type aggregation: ["census_region", "fuel_group_code", "report_date"]
Annual, regional, and fuel type aggregations: ["census_region", "fuel_group_code", "report_year"]

Questions:

Should we use a MMBTU weighted median rather than delivery weighted median?
How should we identify outlier values in the fuel prices which should be replaced? Some are totally whacked.

Other Potential Refinements

Automatically fill using aggregations in order of increasing dispersion of the error distribution (e.g. IQR) rather than hard-coding the order based on intuition and eyeballing it.
Calculate the dispersion of the error distribution on an annual basis, rather than across the entire timeline, in case the temporal, fuel type & spatial correlations change over time.

Remaining tasks:

Always plant_state into the fuel_receipts_costs_eia923 output table all the time.
Add the census regions to state mappings into the metadata enums / constants.
Replace the existing roll & fill method in the fuel_receipts_costs_eia923 output routine.
Update tests to work with the new version of frc_eia923
Remove API_KEY_EIA infrastructure from everywhere in the code, so we aren’t unknowingly relying on it.
Make filling in missing fuel prices the default behavior
Fix the filled_by labeling, which is now showing all filled values having national_fgc_year which is the last aggregation.
Remove fuel_group_code from the fuel_receipts_costs_eia923 table and add it to the energy_sources_eia coding table, and add it back into the output function.
Understand why these changes are apparently affecting ouput row counts
Pull the fuel price filling out into its own separate function
Understand why merge_date() is removing ~10k frc_eia923 records.
Implement weighted median function to use in filling & identifying outliers
Add weighted median unit tests
Identify outlying fuel prices using modified z-score with MMBTU weighted median
Have @cmgosnell look for weirdness in the results of a new MCOE calculation in an RMI context.
Update release notes
After merging into main remove API_KEY_EIA from the GitHub secrets.

Issue Analytics

State:
Created 2 years ago
Comments:49 (48 by maintainers)

Top GitHub Comments

1reaction

zaneselvanscommented, Jun 17, 2022

@TrentonBush The earlier scatter plots are comparing all the reported data – so the ones where there actually was data in the FRC table, and they’re only being aggregated by [state, month, fuel_group]

The more scatter recent plots are only looking at data points that were not present in the FRC table, and comparing the values which were filled in by our new method (breaking it out into all the different kinds of aggregation used) vs. the API values. So it’s not surprising that the correlation is worse in general.

1reaction

joshdr83commented, Jun 16, 2022

Hey sorry, just tuning in! I did a spatial interpolation for fuel prices to average down to the county level for new build estimates in this paper. Is it mostly the non ISO regions that are short of data?

Top Results From Across the Web

Opendata - U.S. Energy Information Administration (EIA)

Petroleum · Summary · includes weekly, monthly, and annual summary data for oil supply and disposition, supply estimates, prices, and sales volumes.

U.S. Energy Information Administration - EIA - EIA

For 2009 forward, state-level nuclear fuel prices are estimated by EIA based on plant-level fuel cost data compiled by SNL Energy. For states...

Retail Motor Gasoline and On-Highway Diesel Fuel Prices - EIA

API Query Browser. EIA Data Sets > Total Energy > Energy Prices > Retail Motor Gasoline and On-Highway Diesel Fuel Prices. API CALL...

EIA's API Technical Documentation - U.S. Energy Information ...

Here, we'll ask for residential prices, tabulated monthly. https://api.eia.gov/v2/electricity/retail-sales/data?api_key=xxxxxx&data[]=price&facets[sectorid][] ...

Total Energy Monthly Data - U.S. Energy Information ... - EIA

This publication includes total energy production, consumption, stocks, and trade; energy prices; overviews of petroleum, natural gas, coal, electricity, ...