question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Sklearn-Pandas trained ML Pipeline - online prediction on GCP ?

See original GitHub issue

Hi Team,

I have a very basic requirement on Sklearn Pandas ML pipeline but I am not able to find any clear answer to this problem any where on the documentations.

Let’s say I have an Sklearn pipeline built using Pandas dataframe as input, with pandas functions as transformations etc. So obviously this pipeline will expect a pandas - dataframe during prediction. Does GCMLE already have capability to convert the key value paired JSON payload to Panda-DF during online prediction mode on CMLE ?

Example code:

#Transformer functions
def select_col_df(df, cols, iscatego):
    if iscatego == True:
        return df[cols]
    else:
        return df[[cols]]

def calc_grminusirbyvpatd(df):
    df['grminusirbyvpatd'] = ( df['TOTGRQTY'] - df['TOTIRQTY'] ) / df['VPATD']
    return(df[['grminusirbyvpatd']])

def calc_difgrirdbytotgrqty(df):
    def apply_trans(df):
        if not df['TOTGRQTY'] == 0:
            return ( df['DIFGRIRD'] / df['TOTGRQTY'] )
        else:
            return 0
    
    df['difgrirdbytotgrqty'] = df.apply(apply_trans, axis = 1)
    return(df[['difgrirdbytotgrqty']])

#1. Hash convert categorical columns
col_pipe = {}
for c_ in X_train.columns:
    if X_train[c_].dtype == 'object':
#         col_pipe[c_] = Pipeline([
#                                 ('column_selector', CatColSelector(key=c_)),
#                                 ('column_oh', CustomLabelBinarizer())
#                             ])
        
        col_pipe[c_] = Pipeline([
                                    ('col_sel', FunctionTransformer(select_col_df,kw_args={'cols': c_, 'iscatego': True},
                                                                    validate=False)),
                                    ('col_hash', FeatureHasher(n_features=10,input_type='string'))
                                ])
        
#2. Also add numerical columns into pipeline
for c_ in X_train.columns:
    if not X_train[c_].dtype == 'object':
        col_pipe[c_] = Pipeline([
                                ('col_sel', FunctionTransformer(select_col_df,kw_args={'cols': c_, 'iscatego': False},
                                                                validate=False)),
                                ('std_scaler', StandardScaler())
                            ])

#3. Create a few new columns: "grminusirbyvpatd, difgrirdbytotgrqty"
col_pipe['grminusirbyvpatd'] = Pipeline([
                                                ('col_new_1',FunctionTransformer(calc_grminusirbyvpatd, validate=False))
                                            ])

col_pipe['difgrirdbytotgrqty'] = Pipeline([
                                                ('col_new_2',FunctionTransformer(calc_difgrirdbytotgrqty, validate=False))
                                            ])

#4. Combine all features in col_pipe{} with FeatureUnion
feats = FeatureUnion([
                        (col_, col_pipe[col_]) for col_ in list(col_pipe.keys())
                    ])

#5. Add ML algorithm in Pipeline
final_pipeline = Pipeline([
                        ('features',feats),
                        ('classifier', RandomForestClassifier(random_state = 42,n_estimators = 1000,
                                                              oob_score=True,n_jobs=-1,verbose=1)),
                         ])

#6. Train ML model
print("Training ML model")
final_pipeline.fit(X_train, y_train)

Seems to be a very basic requirement if we talk about ML on Cloud, any pointers ?

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
dizcologycommented, Apr 25, 2019

@rafiqhasan

It’s been a while and I am not sure if you have already found a way around the potential issues of running pandas functions in a sklearn pipeline in prediction. AI Platform Prediction now offers more flexible ways of deploying sklearn pipelines for prediction: https://cloud.google.com/ml-engine/docs/scikit/custom-prediction-routines

0reactions
andrewferlitschcommented, Mar 16, 2020

There has been no response from submitter for 1/2 year.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Getting online predictions with scikit-learn - AI Platform
The Pipeline module in scikit-learn enables you to apply multiple data transformations before training with an estimator. This encapsulates multiple steps in ...
Read more >
Online predictions API using scikit-learn and Cloud Machine ...
This post will explain steps to train a model, store classifier on Google Cloud Storage and use Cloud Machine Learning to create an...
Read more >
Scikit-learn Model Serving with Online Prediction Using Cloud ...
You can now upload a model you've already trained onto Google Cloud Storage and use ML Engine's online prediction service to support ...
Read more >
Deploying Machine Learning Models on Google Cloud ...
Train on Kaggle; deploy on Google Cloud ... The deployment of a machine learning (ML) model to production starts with actually building the...
Read more >
Vertex AI: Custom training job and prediction using managed ...
It assumes that you are familiar with Machine Learning even though the machine learning code for training is provided to you. You will...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found