Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Feature preparation for weather data to use in AI/ML application

See original GitHub issue

I am following an invitation from @floriankrb and Peter Dueben to share some insights of how I tackle some issues working with Weather data and AI.

Actually we do not know if this is helpful, but i think we need well structured and unified preprocessing for all users and climetlab could be the place to be.

The two points where I think you can treat weather data in the wrong way are:

aligning forecast data with measurement/observation/target data
structuring data for algorithms that require sequences

For both issues I have built solutions and I hope you can validate the way I do .

Alignment of features

from typing import Tuple
import pandas as pd
COLUMN_DT_FORE = 'dt_fore'

def align_features(forecast_data: pd.DataFrame, target_data: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame]:
        """
        takes both predictors and target values and derives intersection
        of both to create two matching dataframes by using dt_fore

        forecast_data contains MultiIndex with dt_calc, dt_fore, positional_index
        dt_calc: INIT/calculation run timestaml
        dt_fore: leading forecast timestamp
        positional_index: location based indexer

        """
        _target_data = []
        _target_index = []
        _rows_to_take = []
        for dt_fore in forecast_data.index.get_level_values(COLUMN_DT_FORE):
            try:
                _target_data.append(target_data.loc[dt_fore, :].values)
                _target_index.append(dt_fore)
                _rows_to_take.append(True)
            except KeyError:
                _rows_to_take.append(False)

        forecast_features = forecast_data.loc[_rows_to_take, :]
        target = pd.DataFrame(_target_data, index=_target_index)
        return forecast_features, target

Preprocess data according to sequences

This topic is relevant in case you would like to use recurrent neural networks like LSTM or Convolutional layers.

import pandas as pd

COLUMN_POSITIONAL_INDEX = 'positional_index'
COLUMN_DT_CALC = 'dt_calc'

def pre_process_lstm_dataframe_with_forecast_data(
    data: pd.DataFrame,
    lstm_sequence_length: int,
) -> pd.DataFrame:
    """
    This pre processing step builds sequence according to the lstm_sequence_length for data that contains forecast.
    A forecast dataset is characterized by a number of dt_calc with several dt_fores for each dt_calc.

    Note: This function requires equal weighted intervals.

    Args:
          data: pd.DataFrame with MultiIndex
          lstm_sequence_length: historical length of sequence in the dimension of time_frequency
          date_offset: granularity of time steps as DateOffset object

    Returns:
         dataframe with list objects as entries

    """

    def seq(n):
        """generator object to pre process data for use in lstm"""
        df = data.reset_index()
        for g in df.groupby(
            [COLUMN_POSITIONAL_INDEX, COLUMN_DT_CALC], sort=False
        ).rolling(n):
            yield g[data.columns].to_numpy().T if len(g) == n else []

    return pd.DataFrame(
        seq(lstm_sequence_length), index=data.index, columns=data.columns
    ).dropna()

As you can see, I am working with historical point forecasts. But I think this should work for arrays as well. In the end every 2D data can be transformed in such a DataFrame, but I think for array data it is not the best to do this with pandas. I am pretty sure that there are smarter solutions than these I am presenting here.

From my point of view these are the most important steps and differences to ordinary ML applications. Please let me know what you think about the topic.

Please note, that I am working hard to establish our company alitiq, so my time to contribute operational code for climetlab is limited. I will give my best to share knowledge and best practice.

I am really looking forward to discuss with you.

Issue Analytics

State:
Created a year ago
Comments:7 (1 by maintainers)

Top GitHub Comments

3reactions

JesperDramschcommented, Apr 12, 2022

It still seems to me that a solution without multiple list, append, and nested for-loops (when the positional index comes into play) would be more efficient.

We can get the common dataframe with a simple inner join:

x = forecast_data.join(target_data, on=[COLUMN_DT_FORE])
x = x[x['target'].notna()]

For the forecast_features you then simple drop the target column

forecast_features = x.drop(['target'], axis='columns')

and the target dataframe needs to drop the multiindex, but should be:

out = x.loc[:, ["target"]].reset_index(level=[0,2], drop=True)
out.index.name = None

I think @floriankrb is more suited to talk about the actual scope of Climetlab and whether processing functions like this should be included, as right now, I understand it more as a data retrieval tool, but I’m just one of the users of it.

1reaction

JesperDramschcommented, Apr 7, 2022

Hi Daniel,

I’m a colleague of @floriankrb. I can’t comment on the scope of ClimetLab to provide a unified pre-processing interface, but I had a quick look at the code. I think preprocessing, data cleaning, and data assimilation are always important steps in any machine learning application, so it’s great to have examples of this code for others available.

I have one comment on a code smell I noticed. I’m always careful when I see for-loops and .append() statements in pandas, as they tend to be very slow compared to vectorized operations like .apply().

Is there a specific reason that you don’t use the index.intersection() method: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.intersection.html

I can imagine it takes a few steps to figure out with a MultiIndex but overall the vectorization should speed up the calculation significantly. I assume that target_data.index.intersection(forecast_data.index) might already go in a good direction otherwise.

Top Results From Across the Web

Weather forecasting with Machine Learning, using Python

Weather forecasting with Machine Learning, using Python. Simple, yet powerful application of Machine Learning for weather forecasting.

WEATHER PREDICTION USING ML ALGORITHMS

The project simply uses temperature, dew, pressure and humidity for training the data. Here these data are then trained using Linear Regression ...

How to Add Global Weather Data to your Machine Learning ...

Add the weather time series to a forecasting algorithm and evaluate its impact on the model accuracy. The article's focus is to make...

Machine Learning in Weather Prediction and Climate ... - MDPI

Abstract: In this paper, we performed an analysis of the 500 most relevant scientific articles published since 2018, concerning machine ...

Fusing Weather Data into Machine Learning Predictions

Weather is inherently linked to geography – the question “how's the weather?” only makes sense in the context of “where” the weather is ......

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Feature preparation for weather data to use in AI/ML application

Alignment of features

Preprocess data according to sequences

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Pip not able to install climetlab (v0.11.31) in windows with Python 3.7 : HDF5 headers not found

Can't merge ECMWF perturbed and control forecast together as xarray since release 0.10