Feature preparation for weather data to use in AI/ML application
See original GitHub issueI am following an invitation from @floriankrb and Peter Dueben to share some insights of how I tackle some issues working with Weather data and AI.
Actually we do not know if this is helpful, but i think we need well structured and unified preprocessing for all users and climetlab could be the place to be.
The two points where I think you can treat weather data in the wrong way are:
- aligning forecast data with measurement/observation/target data
- structuring data for algorithms that require sequences
For both issues I have built solutions and I hope you can validate the way I do .
Alignment of features
from typing import Tuple
import pandas as pd
COLUMN_DT_FORE = 'dt_fore'
def align_features(forecast_data: pd.DataFrame, target_data: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame]:
"""
takes both predictors and target values and derives intersection
of both to create two matching dataframes by using dt_fore
forecast_data contains MultiIndex with dt_calc, dt_fore, positional_index
dt_calc: INIT/calculation run timestaml
dt_fore: leading forecast timestamp
positional_index: location based indexer
"""
_target_data = []
_target_index = []
_rows_to_take = []
for dt_fore in forecast_data.index.get_level_values(COLUMN_DT_FORE):
try:
_target_data.append(target_data.loc[dt_fore, :].values)
_target_index.append(dt_fore)
_rows_to_take.append(True)
except KeyError:
_rows_to_take.append(False)
forecast_features = forecast_data.loc[_rows_to_take, :]
target = pd.DataFrame(_target_data, index=_target_index)
return forecast_features, target
Preprocess data according to sequences
This topic is relevant in case you would like to use recurrent neural networks like LSTM or Convolutional layers.
import pandas as pd
COLUMN_POSITIONAL_INDEX = 'positional_index'
COLUMN_DT_CALC = 'dt_calc'
def pre_process_lstm_dataframe_with_forecast_data(
data: pd.DataFrame,
lstm_sequence_length: int,
) -> pd.DataFrame:
"""
This pre processing step builds sequence according to the lstm_sequence_length for data that contains forecast.
A forecast dataset is characterized by a number of dt_calc with several dt_fores for each dt_calc.
Note: This function requires equal weighted intervals.
Args:
data: pd.DataFrame with MultiIndex
lstm_sequence_length: historical length of sequence in the dimension of time_frequency
date_offset: granularity of time steps as DateOffset object
Returns:
dataframe with list objects as entries
"""
def seq(n):
"""generator object to pre process data for use in lstm"""
df = data.reset_index()
for g in df.groupby(
[COLUMN_POSITIONAL_INDEX, COLUMN_DT_CALC], sort=False
).rolling(n):
yield g[data.columns].to_numpy().T if len(g) == n else []
return pd.DataFrame(
seq(lstm_sequence_length), index=data.index, columns=data.columns
).dropna()
As you can see, I am working with historical point forecasts. But I think this should work for arrays as well. In the end every 2D data can be transformed in such a DataFrame, but I think for array data it is not the best to do this with pandas. I am pretty sure that there are smarter solutions than these I am presenting here.
From my point of view these are the most important steps and differences to ordinary ML applications. Please let me know what you think about the topic.
Please note, that I am working hard to establish our company alitiq, so my time to contribute operational code for climetlab is limited. I will give my best to share knowledge and best practice.
I am really looking forward to discuss with you.
Issue Analytics
- State:
- Created a year ago
- Comments:7 (1 by maintainers)
Top GitHub Comments
It still seems to me that a solution without multiple
list
,append
, and nestedfor
-loops (when the positional index comes into play) would be more efficient.We can get the common dataframe with a simple inner join:
For the
forecast_features
you then simple drop thetarget
columnand the
target
dataframe needs to drop the multiindex, but should be:I think @floriankrb is more suited to talk about the actual scope of Climetlab and whether processing functions like this should be included, as right now, I understand it more as a data retrieval tool, but I’m just one of the users of it.
Hi Daniel,
I’m a colleague of @floriankrb. I can’t comment on the scope of ClimetLab to provide a unified pre-processing interface, but I had a quick look at the code. I think preprocessing, data cleaning, and data assimilation are always important steps in any machine learning application, so it’s great to have examples of this code for others available.
I have one comment on a code smell I noticed. I’m always careful when I see
for
-loops and.append()
statements in pandas, as they tend to be very slow compared to vectorized operations like.apply()
.Is there a specific reason that you don’t use the
index.intersection()
method: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.intersection.htmlI can imagine it takes a few steps to figure out with a MultiIndex but overall the vectorization should speed up the calculation significantly. I assume that
target_data.index.intersection(forecast_data.index)
might already go in a good direction otherwise.