question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Feature preparation for weather data to use in AI/ML application

See original GitHub issue

I am following an invitation from @floriankrb and Peter Dueben to share some insights of how I tackle some issues working with Weather data and AI.

Actually we do not know if this is helpful, but i think we need well structured and unified preprocessing for all users and climetlab could be the place to be.

The two points where I think you can treat weather data in the wrong way are:

  • aligning forecast data with measurement/observation/target data
  • structuring data for algorithms that require sequences

For both issues I have built solutions and I hope you can validate the way I do .

Alignment of features

from typing import Tuple
import pandas as pd
COLUMN_DT_FORE = 'dt_fore'

def align_features(forecast_data: pd.DataFrame, target_data: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame]:
        """
        takes both predictors and target values and derives intersection
        of both to create two matching dataframes by using dt_fore

        forecast_data contains MultiIndex with dt_calc, dt_fore, positional_index
        dt_calc: INIT/calculation run timestaml
        dt_fore: leading forecast timestamp
        positional_index: location based indexer

        """
        _target_data = []
        _target_index = []
        _rows_to_take = []
        for dt_fore in forecast_data.index.get_level_values(COLUMN_DT_FORE):
            try:
                _target_data.append(target_data.loc[dt_fore, :].values)
                _target_index.append(dt_fore)
                _rows_to_take.append(True)
            except KeyError:
                _rows_to_take.append(False)

        forecast_features = forecast_data.loc[_rows_to_take, :]
        target = pd.DataFrame(_target_data, index=_target_index)
        return forecast_features, target

Preprocess data according to sequences

This topic is relevant in case you would like to use recurrent neural networks like LSTM or Convolutional layers.

import pandas as pd

COLUMN_POSITIONAL_INDEX = 'positional_index'
COLUMN_DT_CALC = 'dt_calc'

def pre_process_lstm_dataframe_with_forecast_data(
    data: pd.DataFrame,
    lstm_sequence_length: int,
) -> pd.DataFrame:
    """
    This pre processing step builds sequence according to the lstm_sequence_length for data that contains forecast.
    A forecast dataset is characterized by a number of dt_calc with several dt_fores for each dt_calc.

    Note: This function requires equal weighted intervals.

    Args:
          data: pd.DataFrame with MultiIndex
          lstm_sequence_length: historical length of sequence in the dimension of time_frequency
          date_offset: granularity of time steps as DateOffset object

    Returns:
         dataframe with list objects as entries

    """

    def seq(n):
        """generator object to pre process data for use in lstm"""
        df = data.reset_index()
        for g in df.groupby(
            [COLUMN_POSITIONAL_INDEX, COLUMN_DT_CALC], sort=False
        ).rolling(n):
            yield g[data.columns].to_numpy().T if len(g) == n else []

    return pd.DataFrame(
        seq(lstm_sequence_length), index=data.index, columns=data.columns
    ).dropna()

As you can see, I am working with historical point forecasts. But I think this should work for arrays as well. In the end every 2D data can be transformed in such a DataFrame, but I think for array data it is not the best to do this with pandas. I am pretty sure that there are smarter solutions than these I am presenting here.

From my point of view these are the most important steps and differences to ordinary ML applications. Please let me know what you think about the topic.

Please note, that I am working hard to establish our company alitiq, so my time to contribute operational code for climetlab is limited. I will give my best to share knowledge and best practice.

I am really looking forward to discuss with you.

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:7 (1 by maintainers)

github_iconTop GitHub Comments

3reactions
JesperDramschcommented, Apr 12, 2022

It still seems to me that a solution without multiple list, append, and nested for-loops (when the positional index comes into play) would be more efficient.

We can get the common dataframe with a simple inner join:

x = forecast_data.join(target_data, on=[COLUMN_DT_FORE])
x = x[x['target'].notna()]

For the forecast_features you then simple drop the target column

forecast_features = x.drop(['target'], axis='columns')

and the target dataframe needs to drop the multiindex, but should be:

out = x.loc[:, ["target"]].reset_index(level=[0,2], drop=True)
out.index.name = None

I think @floriankrb is more suited to talk about the actual scope of Climetlab and whether processing functions like this should be included, as right now, I understand it more as a data retrieval tool, but I’m just one of the users of it.

1reaction
JesperDramschcommented, Apr 7, 2022

Hi Daniel,

I’m a colleague of @floriankrb. I can’t comment on the scope of ClimetLab to provide a unified pre-processing interface, but I had a quick look at the code. I think preprocessing, data cleaning, and data assimilation are always important steps in any machine learning application, so it’s great to have examples of this code for others available.

I have one comment on a code smell I noticed. I’m always careful when I see for-loops and .append() statements in pandas, as they tend to be very slow compared to vectorized operations like .apply().

Is there a specific reason that you don’t use the index.intersection() method: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.intersection.html

I can imagine it takes a few steps to figure out with a MultiIndex but overall the vectorization should speed up the calculation significantly. I assume that target_data.index.intersection(forecast_data.index) might already go in a good direction otherwise.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Weather forecasting with Machine Learning, using Python
Weather forecasting with Machine Learning, using Python. Simple, yet powerful application of Machine Learning for weather forecasting.
Read more >
WEATHER PREDICTION USING ML ALGORITHMS
The project simply uses temperature, dew, pressure and humidity for training the data. Here these data are then trained using Linear Regression ...
Read more >
How to Add Global Weather Data to your Machine Learning ...
Add the weather time series to a forecasting algorithm and evaluate its impact on the model accuracy. The article's focus is to make...
Read more >
Machine Learning in Weather Prediction and Climate ... - MDPI
Abstract: In this paper, we performed an analysis of the 500 most relevant scientific articles published since 2018, concerning machine ...
Read more >
Fusing Weather Data into Machine Learning Predictions
Weather is inherently linked to geography – the question “how's the weather?” only makes sense in the context of “where” the weather is ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found