How to bin new samples with splits generated from training?
See original GitHub issueHi,
First of all, congrats for the library!
I’m trying to use the bins generated and use it to bin data in production. Here is my example:
from optbinning import OptimalBinning
x = df["var1"].to_numpy()
y = df["target"].to_numpy()
optb = OptimalBinning(name='var1')
optb.fit(x, y)
optb.splits
Then I get an array with the splits (for a numerical variable). That’s great, but if I want to categorize ‘var1’ for one row of new data (not part of training), what’s the best way to do it?
One way to do it using pandas.cut()
, but I don’t know if there is a better way to do it using optbinning
.
pd.cut(df_new["var1"], bins=optb.splits)
Also, how can I deal with categorical variables, as some bins are more than one category merged into an array (eg: ['cat1', 'cat3', 'catn']
).
Att
Gabriel
Issue Analytics
- State:
- Created 3 years ago
- Comments:9 (9 by maintainers)
Top Results From Across the Web
Binning for Feature Engineering in Machine Learning
Cut will split our column up using the label names and ranges we provide. Note, for 10 to be included we'll have to...
Read more >How to split data into training/testing sets using sample function
Use base R. Function runif generates uniformly distributed values from 0 to 1.By varying cutoff value ( ...
Read more >Simple Training/Test Set Splitting — initial_split • rsample
initial_split creates a single binary split of the data into a training set and testing set. initial_time_split does the same, but takes the...
Read more >Split the Dataset into the Training & Test Set in R
Method 1: Using base R · vec – A vector or matrix of elements from where to choose the sample. · size –...
Read more >Tutorial: optimal binning with binary target
Bin : the intervals delimited by the optimal split points. ... For this example, let's load data from the FICO Explainable Machine Learning...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi Guilhermo,
Getting the indices helped a lot, specially on the posterior steps for doing the one-hot encoding and organizing the features categories.
I think this method would be complete if it also offered the possibility to get the bin names. For example:
Using the
transform
method I would get:What would be extremely handy:
The way I’m getting this information is by using a list comprehension:
I’m working on the design for a credit scoring class and I’m planning to share with you by the end of the day.
Att
Gabriel
Hi Guilhermo,
Thanks for the fast answer!
The
transform
method withmetrics='indices'
is exactly what I need. It would be extremely handy for creating ScoreCards and to use in production.Would you be willing to implement it? If you need a hand for this task I’ll be glad to help.
I really appreciate your attention,
Gabriel