Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Getting “nan” with cross_val_score and StackingClassifier or Voting Classifier

See original GitHub issue

I want to use StackingClassifier & VotingClassifier with StratifiedKFold & cross_val_score. I am getting nan values in cross_val_score if I use StackingClassifier or VotingClassifier. If I use any other algorithm instead of StackingClassifier or VotingClassifier, cross_val_score works fine. I am using python 3.8.5 & sklearn 0.23.2.

Jupyter notebook attached StackingVotingClassifierIssue.ipynb. Dataset attached Parkinsons.csv. I had moved the status column (which is the target feature) to the right most end in parkinsons.csv.
Dataset can also be found at this Kaggle Link Below is the full code and the output.

StackingVotingClassifierIssue.zip

import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn import preprocessing
from sklearn import metrics
from sklearn import model_selection
from sklearn import feature_selection

from imblearn.over_sampling import SMOTE

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import StackingClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import RandomForestClassifier

import warnings
warnings.filterwarnings('ignore')

dataset = pd.read_csv('parkinsons.csv')


FS_X=dataset.iloc[:,:-1]
FS_y=dataset.iloc[:,-1:]

FS_X.drop(['name'],axis=1,inplace=True)

select_k_best = feature_selection.SelectKBest(score_func=feature_selection.f_classif,k=15)
X_k_best = select_k_best.fit_transform(FS_X,FS_y)

supportList = select_k_best.get_support().tolist()
p_valuesList = select_k_best.pvalues_.tolist()

toDrop=[]

for i in np.arange(len(FS_X.columns)):
    bool = supportList[i]
    if(bool == False):
        toDrop.append(FS_X.columns[i])     

FS_X.drop(toDrop,axis=1,inplace=True)        

smote = SMOTE(random_state=7)
Balanced_X,Balanced_y = smote.fit_sample(FS_X,FS_y)
before = pd.merge(FS_X,FS_y,right_index=True, left_index=True)
after = pd.merge(Balanced_X,Balanced_y,right_index=True, left_index=True)
b=before['status'].value_counts()
a=after['status'].value_counts()
print('Before')
print(b)
print('After')
print(a)

SkFold = model_selection.StratifiedKFold(n_splits=10, random_state=7, shuffle=False)

estimators_list = list()

KNN = KNeighborsClassifier()
RF = RandomForestClassifier(criterion='entropy',random_state = 1)
DT = DecisionTreeClassifier(criterion='entropy',random_state = 1)
GNB = GaussianNB()
LR = LogisticRegression(random_state = 1)

estimators_list.append(LR)
estimators_list.append(RF)
estimators_list.append(DT)
estimators_list.append(GNB)

SCLF = StackingClassifier(estimators = estimators_list,final_estimator = KNN,stack_method = 'predict_proba',cv=SkFold,n_jobs = -1)
VCLF = VotingClassifier(estimators = estimators_list,voting = 'soft',n_jobs = -1)

scores1 = model_selection.cross_val_score(estimator = SCLF,X=Balanced_X.values,y=Balanced_y.values,scoring='accuracy',cv=SkFold)
print('StackingClassifier Scores',scores1)

scores2 = model_selection.cross_val_score(estimator = VCLF,X=Balanced_X.values,y=Balanced_y.values,scoring='accuracy',cv=SkFold)
print('VotingClassifier Scores',scores2)

scores3 = model_selection.cross_val_score(estimator = DT,X=Balanced_X.values,y=Balanced_y.values,scoring='accuracy',cv=SkFold)
print('DecisionTreeClassifier Scores',scores3)

Output

Before
1    147
0     48
Name: status, dtype: int64
After
1    147
0    147
Name: status, dtype: int64
StackingClassifier Scores [nan nan nan nan nan nan nan nan nan nan]
VotingClassifier Scores [nan nan nan nan nan nan nan nan nan nan]
DecisionTreeClassifier Scores [0.86666667 0.9        0.93333333 0.86666667 0.96551724 0.82758621
 0.75862069 0.86206897 0.86206897 0.93103448]

Issue Analytics

State:
Created 3 years ago
Comments:6 (4 by maintainers)

Top GitHub Comments

4reactions

glemaitrecommented, Dec 2, 2020

This is because your cross_val_score is raising an internal error. By default we are permissive and replace the score by a nan. To get the traceback, you need to pass error_score="raise" and in your case I am getting:

ValueError: Setting a random_state has no effect since shuffle is False. You should leave random_state to its default (None), or set shuffle=True.

0reactions

tusharkhochecommented, Dec 2, 2020

Hi @glemaitre , Thank you very much for the response. Got the issue. It worked. I feel like a fool. 😃 And regarding SMOTE thing, this was just a minimal working example so that anyone who looks this issue would just need to download and run the code. But thank you for this.

Top Results From Across the Web

Getting "nan" with cross_val_score and StackingClassifier or ...

I want to use StackingClassifier & VotingClassifier with StratifiedKFold & cross_val_score. I am getting nan values in cross_val_score if I ...

StackingCVRegressor: stacking with cross-validation for ...

An ensemble-learning meta-regressor for stacking regression ... If True, allow multi-output targets, but forbid nan or inf values.

1.11. Ensemble methods — scikit-learn 1.2.0 documentation

from sklearn.ensemble import HistGradientBoostingClassifier >>> import numpy as np >>> X = np.array([0, 1, 2, np.nan]).reshape(-1, 1) >>> y = [0, 0, 1, ......

Ensemble Learning Techniques Tutorial - Kaggle

150, NaN, NaN, 0. 160, 138647.380952, 146000.0, 63. 180, 102300.000000, 88500.0, 10. 190, 129613.333333, 128250.0, 30. 20, 185224.811567, 159250.0, 536.

ValueError: Input contains NaN, infinity or a value too large for ...

With np.isnan(X) you get a boolean mask back with True for positions containing NaN s. With np.where(np.isnan(X)) you get back a tuple with...