question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to bin new samples with splits generated from training?

See original GitHub issue

Hi,

First of all, congrats for the library!

I’m trying to use the bins generated and use it to bin data in production. Here is my example:

from optbinning import OptimalBinning

x = df["var1"].to_numpy()
y = df["target"].to_numpy()

optb = OptimalBinning(name='var1')
optb.fit(x, y)
optb.splits

Then I get an array with the splits (for a numerical variable). That’s great, but if I want to categorize ‘var1’ for one row of new data (not part of training), what’s the best way to do it?

One way to do it using pandas.cut(), but I don’t know if there is a better way to do it using optbinning.

pd.cut(df_new["var1"],  bins=optb.splits)

Also, how can I deal with categorical variables, as some bins are more than one category merged into an array (eg: ['cat1', 'cat3', 'catn']).

Att

Gabriel

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:9 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
GabrielSGoncalvescommented, Apr 16, 2020

Hi Guilhermo,

Getting the indices helped a lot, specially on the posterior steps for doing the one-hot encoding and organizing the features categories.

I think this method would be complete if it also offered the possibility to get the bin names. For example:

from optbinning import OptimalBinning
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
variable = "mean radius"
x = df[variable].values
y = data.target
optb = OptimalBinning(name=variable, dtype="numerical", solver="cp")
optb.fit(x, y)

Using the transform method I would get:

optb.transform(x, metric="indices")[:20]
>> array([6, 6, 6, 0, 6, 2, 6, 4, 2, 2, 5, 5, 6, 5, 4, 4, 4, 5, 6, 3])

What would be extremely handy:

optb.transform(x, metric="bins")[:20]
>> ['[16.93, inf)',
 '[16.93, inf)',
 '[16.93, inf)',
 '[-inf, 11.43)',
 '[16.93, inf)',
 '[12.33, 13.09)',
 '[16.93, inf)',
 '[13.70, 15.05)',
 '[12.33, 13.09)',
 '[12.33, 13.09)',
 '[15.05, 16.93)',
 '[15.05, 16.93)',
 '[16.93, inf)',
 '[15.05, 16.93)',
 '[13.70, 15.05)',
 '[13.70, 15.05)',
 '[13.70, 15.05)',
 '[15.05, 16.93)',
 '[16.93, inf)',
 '[13.09, 13.70)']

The way I’m getting this information is by using a list comprehension:

[    optb.binning_table.build().at[x, "Bin"]
    for x in optb.transform(x, metric="indices")[:20] ]

I’m working on the design for a credit scoring class and I’m planning to share with you by the end of the day.

Att

Gabriel

1reaction
GabrielSGoncalvescommented, Apr 15, 2020

Hi Guilhermo,

Thanks for the fast answer!

The transform method with metrics='indices' is exactly what I need. It would be extremely handy for creating ScoreCards and to use in production.

Would you be willing to implement it? If you need a hand for this task I’ll be glad to help.

I really appreciate your attention,

Gabriel

Read more comments on GitHub >

github_iconTop Results From Across the Web

Binning for Feature Engineering in Machine Learning
Cut will split our column up using the label names and ranges we provide. Note, for 10 to be included we'll have to...
Read more >
How to split data into training/testing sets using sample function
Use base R. Function runif generates uniformly distributed values from 0 to 1.By varying cutoff value ( ...
Read more >
Simple Training/Test Set Splitting — initial_split • rsample
initial_split creates a single binary split of the data into a training set and testing set. initial_time_split does the same, but takes the...
Read more >
Split the Dataset into the Training & Test Set in R
Method 1: Using base R · vec – A vector or matrix of elements from where to choose the sample. · size –...
Read more >
Tutorial: optimal binning with binary target
Bin : the intervals delimited by the optimal split points. ... For this example, let's load data from the FICO Explainable Machine Learning...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found