Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Sklearn-Pandas trained ML Pipeline - online prediction on GCP ?

See original GitHub issue

Hi Team,

I have a very basic requirement on Sklearn Pandas ML pipeline but I am not able to find any clear answer to this problem any where on the documentations.

Let’s say I have an Sklearn pipeline built using Pandas dataframe as input, with pandas functions as transformations etc. So obviously this pipeline will expect a pandas - dataframe during prediction. Does GCMLE already have capability to convert the key value paired JSON payload to Panda-DF during online prediction mode on CMLE ?

Example code:

#Transformer functions
def select_col_df(df, cols, iscatego):
    if iscatego == True:
        return df[cols]
    else:
        return df[[cols]]

def calc_grminusirbyvpatd(df):
    df['grminusirbyvpatd'] = ( df['TOTGRQTY'] - df['TOTIRQTY'] ) / df['VPATD']
    return(df[['grminusirbyvpatd']])

def calc_difgrirdbytotgrqty(df):
    def apply_trans(df):
        if not df['TOTGRQTY'] == 0:
            return ( df['DIFGRIRD'] / df['TOTGRQTY'] )
        else:
            return 0
    
    df['difgrirdbytotgrqty'] = df.apply(apply_trans, axis = 1)
    return(df[['difgrirdbytotgrqty']])

#1. Hash convert categorical columns
col_pipe = {}
for c_ in X_train.columns:
    if X_train[c_].dtype == 'object':
#         col_pipe[c_] = Pipeline([
#                                 ('column_selector', CatColSelector(key=c_)),
#                                 ('column_oh', CustomLabelBinarizer())
#                             ])
        
        col_pipe[c_] = Pipeline([
                                    ('col_sel', FunctionTransformer(select_col_df,kw_args={'cols': c_, 'iscatego': True},
                                                                    validate=False)),
                                    ('col_hash', FeatureHasher(n_features=10,input_type='string'))
                                ])
        
#2. Also add numerical columns into pipeline
for c_ in X_train.columns:
    if not X_train[c_].dtype == 'object':
        col_pipe[c_] = Pipeline([
                                ('col_sel', FunctionTransformer(select_col_df,kw_args={'cols': c_, 'iscatego': False},
                                                                validate=False)),
                                ('std_scaler', StandardScaler())
                            ])

#3. Create a few new columns: "grminusirbyvpatd, difgrirdbytotgrqty"
col_pipe['grminusirbyvpatd'] = Pipeline([
                                                ('col_new_1',FunctionTransformer(calc_grminusirbyvpatd, validate=False))
                                            ])

col_pipe['difgrirdbytotgrqty'] = Pipeline([
                                                ('col_new_2',FunctionTransformer(calc_difgrirdbytotgrqty, validate=False))
                                            ])

#4. Combine all features in col_pipe{} with FeatureUnion
feats = FeatureUnion([
                        (col_, col_pipe[col_]) for col_ in list(col_pipe.keys())
                    ])

#5. Add ML algorithm in Pipeline
final_pipeline = Pipeline([
                        ('features',feats),
                        ('classifier', RandomForestClassifier(random_state = 42,n_estimators = 1000,
                                                              oob_score=True,n_jobs=-1,verbose=1)),
                         ])

#6. Train ML model
print("Training ML model")
final_pipeline.fit(X_train, y_train)

Seems to be a very basic requirement if we talk about ML on Cloud, any pointers ?

Issue Analytics

State:
Created 5 years ago
Comments:8 (4 by maintainers)

Top GitHub Comments

1reaction

dizcologycommented, Apr 25, 2019

@rafiqhasan

It’s been a while and I am not sure if you have already found a way around the potential issues of running pandas functions in a sklearn pipeline in prediction. AI Platform Prediction now offers more flexible ways of deploying sklearn pipelines for prediction: https://cloud.google.com/ml-engine/docs/scikit/custom-prediction-routines

0reactions

andrewferlitschcommented, Mar 16, 2020

There has been no response from submitter for 1/2 year.

Top Results From Across the Web

Getting online predictions with scikit-learn - AI Platform

The Pipeline module in scikit-learn enables you to apply multiple data transformations before training with an estimator. This encapsulates multiple steps in ...

Online predictions API using scikit-learn and Cloud Machine ...

This post will explain steps to train a model, store classifier on Google Cloud Storage and use Cloud Machine Learning to create an...

Scikit-learn Model Serving with Online Prediction Using Cloud ...

You can now upload a model you've already trained onto Google Cloud Storage and use ML Engine's online prediction service to support ...

Deploying Machine Learning Models on Google Cloud ...

Train on Kaggle; deploy on Google Cloud ... The deployment of a machine learning (ML) model to production starts with actually building the...

Vertex AI: Custom training job and prediction using managed ...

It assumes that you are familiar with Machine Learning even though the machine learning code for training is provided to you. You will...