Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

feature request: only apply random noise in `RandomAdder` to training data

See original GitHub issue

Currently, RandomAdder adds noise to data both at training and at prediction time. This causes predictions to become non-deterministic and it offers no clear benefit in most cases I can think of.

I suggest changing the default behaviour of the transformer to only add random noise to the train data and optionally through a constructor flag also to the prediction data.

Issue Analytics

State:
Created 5 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

MBrounscommented, Mar 26, 2019

So I got asked a similar type of thing in todays training where someone wanted to drop rows with too many missing values from train but not from test so I was toying around to see if I could find something that would work.

I might have figured out a way but I’m not sure I like it all that much:

import pandas as pd
import hashlib

class TrainOnlyMixin(BaseEstimator, TransformerMixin):
    
    def fit(self, X, y):
        self.df_hash_ = self.hash_df(X)
        return self
    
    
    @staticmethod
    def hash_df(df):
        return hashlib.sha256(pd.util.hash_pandas_object(df, index=True).values).hexdigest()
    
    def transform(self, X, y=None):
    
        if self.hash_df(X) == self.df_hash_:
            return self.transform_train(X)
        
        else:
            return self.transform_test(X)

I basically store a hash of the train dataframe and compare X with it in transform and then call transform_train or transform_test. I think this can be made quite generic and I can’t think of a case where it wouldn’t work. What do you think?

0reactions

MBrounscommented, Mar 28, 2019

#81