Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to bin new samples with splits generated from training?

See original GitHub issue

Hi,

First of all, congrats for the library!

I’m trying to use the bins generated and use it to bin data in production. Here is my example:

from optbinning import OptimalBinning

x = df["var1"].to_numpy()
y = df["target"].to_numpy()

optb = OptimalBinning(name='var1')
optb.fit(x, y)
optb.splits

Then I get an array with the splits (for a numerical variable). That’s great, but if I want to categorize ‘var1’ for one row of new data (not part of training), what’s the best way to do it?

One way to do it using pandas.cut(), but I don’t know if there is a better way to do it using optbinning.

pd.cut(df_new["var1"],  bins=optb.splits)

Also, how can I deal with categorical variables, as some bins are more than one category merged into an array (eg: ['cat1', 'cat3', 'catn']).

Att

Gabriel

Issue Analytics

State:
Created 3 years ago
Comments:9 (9 by maintainers)

Top GitHub Comments

1reaction

GabrielSGoncalvescommented, Apr 16, 2020

Hi Guilhermo,

Getting the indices helped a lot, specially on the posterior steps for doing the one-hot encoding and organizing the features categories.

I think this method would be complete if it also offered the possibility to get the bin names. For example:

from optbinning import OptimalBinning
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
variable = "mean radius"
x = df[variable].values
y = data.target
optb = OptimalBinning(name=variable, dtype="numerical", solver="cp")
optb.fit(x, y)

Using the transform method I would get:

optb.transform(x, metric="indices")[:20]
>> array([6, 6, 6, 0, 6, 2, 6, 4, 2, 2, 5, 5, 6, 5, 4, 4, 4, 5, 6, 3])

What would be extremely handy:

optb.transform(x, metric="bins")[:20]
>> ['[16.93, inf)',
 '[16.93, inf)',
 '[16.93, inf)',
 '[-inf, 11.43)',
 '[16.93, inf)',
 '[12.33, 13.09)',
 '[16.93, inf)',
 '[13.70, 15.05)',
 '[12.33, 13.09)',
 '[12.33, 13.09)',
 '[15.05, 16.93)',
 '[15.05, 16.93)',
 '[16.93, inf)',
 '[15.05, 16.93)',
 '[13.70, 15.05)',
 '[13.70, 15.05)',
 '[13.70, 15.05)',
 '[15.05, 16.93)',
 '[16.93, inf)',
 '[13.09, 13.70)']

The way I’m getting this information is by using a list comprehension:

[    optb.binning_table.build().at[x, "Bin"]
    for x in optb.transform(x, metric="indices")[:20] ]

I’m working on the design for a credit scoring class and I’m planning to share with you by the end of the day.

Att

Gabriel

1reaction

GabrielSGoncalvescommented, Apr 15, 2020

Hi Guilhermo,

Thanks for the fast answer!

The transform method with metrics='indices' is exactly what I need. It would be extremely handy for creating ScoreCards and to use in production.

Would you be willing to implement it? If you need a hand for this task I’ll be glad to help.

I really appreciate your attention,

Gabriel

Top Results From Across the Web

Binning for Feature Engineering in Machine Learning

Cut will split our column up using the label names and ranges we provide. Note, for 10 to be included we'll have to...

How to split data into training/testing sets using sample function

Use base R. Function runif generates uniformly distributed values from 0 to 1.By varying cutoff value ( ...

Simple Training/Test Set Splitting — initial_split • rsample

initial_split creates a single binary split of the data into a training set and testing set. initial_time_split does the same, but takes the...

Split the Dataset into the Training & Test Set in R

Method 1: Using base R · vec – A vector or matrix of elements from where to choose the sample. · size –...

Tutorial: optimal binning with binary target

Bin : the intervals delimited by the optimal split points. ... For this example, let's load data from the FICO Explainable Machine Learning...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

How to bin new samples with splits generated from training?

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

[QUESTION] Support for ortools 9.4 solvers natively on MacOS with Silicon?

Setting 'user_splits_fixed' in categorical binning