Infinite loop bug in Gridsearch CV Svm.SVC(), Windows 10
See original GitHub issueThis is going to be a bit shorter as I cannot determine with 100% accuracy what steps are causing this bug.
Setup: Windows 10 most recent update Anaconda most recent update Jupyter Notebook and Prompt latest update (issue tested in both; no warnings, errors, bugs, etc are printed in either) All packages (scikit, numpy, etc.) latest update AMD FX 8350 8-core Nvidia GeForce GTX 980 16 GB ram
The issue: When running with n_jobs set to -1, my grid_search_wrapper runs fine when calculating MLPClassifier() and takes up ~70% of CPU processing power. The jobs (192 x 10 crossvalidation = 1920) run in about 8 minutes and returns the expected dataframe of results.
When running with the clf set to a SVM machine, the process always starts up and prints out:
Fitting X folds for each of Y candidates, totalling (sic) X*Y fits
After this, my computer sits for hours without any progress. Killing the kernel does not halt the ~10-15 spawned processes. When n_jobs is set to -1 killing python through task manager ends the CPU usage. When n_jobs = 1, my CPU usage is only ~20% (I believe only one core is being utillized) but no python processes are spawned in task manager. I have to restart my computer to stop the single core calculation.
Note that training individual models without passing them through the gridsearch function succeeds. I have not tested all combinations by hands, but I have tested all the individual kernels. Training an SVM model took, on average, 1-2 minutes when running singularly.
Here are the following variations of inputs and the resultant output of the grid_search_wrapper function:
with n_jobs = -1:
ml_params = {
'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
'degree': [2,3,4],
tol': [1e-3, 1e-4, 1e-2]
}
FAIL
ml_params = {
'kernel': ['linear', 'rbf', 'sigmoid'],
}
PASS
ml_params = {
'kernel': ['linear', 'rbf', 'sigmoid'],
tol': [1e-3, 1e-4, 1e-2]
}
FAIL
ml_params = {
'kernel': ['linear']
}
Pass
ml_params = {
'kernel': ['rbf']
}
Pass
ml_params = {
'kernel': ['sigmoid']
}
Pass
ml_params = {
'kernel': ['poly']
}
FAIL
with n_jobs = 1:
ml_params = {
'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
'degree': [2,3,4],
tol': [1e-3, 1e-4, 1e-2]
}
FAIL
ml_params = {
'kernel': ['linear', 'rbf', 'sigmoid'],
}
FAIL
ml_params = {
'kernel': ['linear', 'rbf', 'sigmoid'],
tol': [1e-3, 1e-4, 1e-2]
}
FAIL
ml_params = {
'kernel': ['linear']
}
Pass
ml_params = {
'kernel': ['rbf']
}
Pass
ml_params = {
'kernel': ['sigmoid']
}
Pass
ml_params = {
'kernel': ['poly']
}
FAIL
Note that k_folds was set to 3 instead of 10 as was used when training MLPClassifier in order to help facilitate figuring out what was happening with SVM. I think this setting is irrelevant to the problem.
Data set: 15000 instances x 90 predictors (relatively small, memory usage is about 2 GB during SVM runs)
Code dump:
def grid_search_wrapper(clf, param_grid, scoring, X_train, X_test, y_train, y_test, refit_score='accuracy_score'):
#https://towardsdatascience.com/fine-tuning-a-classifier-in-scikit-learn-66e048c21e65
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
"""
fits a GridSearchCV classifier using refit_score for optimization
prints classifier performance metrics
"""
skf = StratifiedKFold(n_splits=3)
grid_search = GridSearchCV(clf, param_grid, cv = skf, scoring=scorers, refit=refit_score, return_train_score=True, n_jobs=1, verbose = 1)
grid_search.fit(X_train, y_train)
# make the predictions
y_pred = grid_search.predict(X_test)
print('Best params for {}'.format(refit_score))
print(grid_search.best_params_)
# confusion matrix on the test data.
print('\nConfusion matrix of Model optimized for {} on the test data:'.format(refit_score))
print(pd.DataFrame(confusion_matrix(y_test, y_pred),
columns=['pred_neg', 'pred_pos'], index=['neg', 'pos']))
return grid_search
#ignore my shitty code and not importing at the top
from sklearn.metrics import roc_curve, precision_recall_curve, auc, make_scorer, recall_score, accuracy_score, precision_score, confusion_matrix
#gridsearch for optimal MLPs
#ml_params = {
# 'activation': ['relu', 'tanh', 'logistic'],
# 'alpha': [1e-3, 1e-4, 1e-5, 1e-6],
# 'hidden_layer_sizes': [[100,25,], [50,50,], [75,25,25], [50,25,10]],
# 'max_iter': [100, 500, 1000, 2500]
#}
ml_params = {
# 'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
#poly kernel is problematic
'kernel': ['linear', 'rbf', 'sigmoid'],
#'degree': [2,3,4],
'tol': [1e-3, 1e-4, 1e-2]
#SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
# decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
# max_iter=-1, probability=False, random_state=None, shrinking=True,
# tol=0.001, verbose=False)
}
scorers = {
'precision_score': make_scorer(precision_score),
'recall_score': make_scorer(recall_score),
'accuracy_score': make_scorer(accuracy_score)
}
grid_search_clf = grid_search_wrapper(refit_score = 'recall_score', param_grid = ml_params, scoring = scorers, X_train = x_train, X_test = x_test, y_train = y_train, y_test = y_test, clf = svm.SVC())
results = pd.DataFrame(grid_search_clf.cv_results_)
results = results.sort_values(by='mean_test_recall_score', ascending=False)
#for MLP
#results[['mean_test_precision_score', 'mean_test_accuracy_score', 'mean_test_recall_score', 'param_activation', 'param_alpha', 'param_hidden_layer_sizes', 'param_max_iter']]
#for svm
results[['mean_test_precision_score', 'mean_test_accuracy_score', 'mean_test_recall_score', 'param_kernel', 'param_tol']]
Issue Analytics
- State:
- Created 5 years ago
- Comments:16 (6 by maintainers)
Top GitHub Comments
No the issue is probably the infinite max_iter
Searching over tol is an unusual things to do. You should try setting a finite max_iter