Parallalism in gridsearcCV is ending up with a permission error
See original GitHub issueDescription - Parallelism(n_jobs =-1) in grid search cv is stopping with a permission error.
Steps/Code to Reproduce -
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import TimeSeriesSplit , GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.utils import parallel_backend
#Standardization of Data
X_Train_Vectors_Std = StandardScaler(with_mean = False).fit_transform(X_Train_Vectors)
X_test_Vectors_Std = StandardScaler(with_mean = False).fit_transform(X_test_Vectors)
#creating List of lambda values that are to be searched
lambdaList = [10**-4, 10**-2, 10**0, 10**2, 10**4]
time_split = TimeSeriesSplit(n_splits=5)
param_search= dict(C = lambdaList)
grid = GridSearchCV(estimator = LogisticRegression(solver='saga'), param_grid = param_search,n_jobs = -1, scoring = 'f1_weighted', cv=time_split.split(X_Train_Vectors_Std)
,return_train_score = True )
grid.fit(X_Train_Vectors_Std,Y_Train)
Expected Results : No error is expected
Actual Results
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\externals\joblib\disk.py:122: UserWarning: Unable to delete folder C:\Users\HANI\AppData\Local\Temp\joblib_memmapping_folder_13296_3875384810 after 5 tentatives.
.format(folder_path, RM_SUBDIRS_N_RETRY))
---------------------------------------------------------------------------
PermissionError Traceback (most recent call last)
<ipython-input-6-c065dfe04993> in <module>()
9 grid = GridSearchCV(estimator = LogisticRegression(solver='saga'), param_grid = param_search,n_jobs = -1, scoring = 'f1_weighted', cv=time_split.split(X_Train_Vectors_Std)
10 ,return_train_score = True )
---> 11 grid.fit(X_Train_Vectors_Std,Y_Train)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py in fit(self, X, y, groups, **fit_params)
720 return results_container[0]
721
--> 722 self._run_search(evaluate_candidates)
723
724 results = results_container[0]
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in __exit__(self, exc_type, exc_value, traceback)
730
731 def __exit__(self, exc_type, exc_value, traceback):
--> 732 self._terminate_backend()
733 self._managed_backend = False
734
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in _terminate_backend(self)
760 def _terminate_backend(self):
761 if self._backend is not None:
--> 762 self._backend.terminate()
763
764 def _dispatch(self, batch):
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py in terminate(self)
524 # in latter calls but we free as much memory as we can by deleting
525 # the shared memory
--> 526 delete_folder(self._workers._temp_folder)
527 self._workers = None
528
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\externals\joblib\disk.py in delete_folder(folder_path, onerror)
113 while True:
114 try:
--> 115 shutil.rmtree(folder_path, False, None)
116 break
117 except (OSError, WindowsError):
C:\ProgramData\Anaconda3\lib\shutil.py in rmtree(path, ignore_errors, onerror)
492 os.close(fd)
493 else:
--> 494 return _rmtree_unsafe(path, onerror)
495
496 # Allow introspection of whether or not the hardening against symlink
C:\ProgramData\Anaconda3\lib\shutil.py in _rmtree_unsafe(path, onerror)
387 os.unlink(fullname)
388 except OSError:
--> 389 onerror(os.unlink, fullname, sys.exc_info())
390 try:
391 os.rmdir(path)
C:\ProgramData\Anaconda3\lib\shutil.py in _rmtree_unsafe(path, onerror)
385 else:
386 try:
--> 387 os.unlink(fullname)
388 except OSError:
389 onerror(os.unlink, fullname, sys.exc_info())
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\HANI\\AppData\\Local\\Temp\\joblib_memmapping_folder_13296_3875384810\\13296-2443532547352-7b8cd102e07c472ab00885ea9ca3e72d.pkl'
Versions
Windows-10-10.0.17134-SP0 Python 3.6.3 |Anaconda custom (64-bit)| (default, Oct 15 2017, 03:27:45) [MSC v.1900 64 bit (AMD64)] NumPy 1.15.2 SciPy 1.1.0 Scikit-Learn 0.20.0
Issue Analytics
- State:
- Created 5 years ago
- Reactions:3
- Comments:57 (38 by maintainers)
Top Results From Across the Web
Error when running gridsearchcv with pipeline - Stack Overflow
When I start training the model I get the following error. ... "C:\Users\burak\anaconda3\lib\site-packages\joblib\parallel.py", line 262, ...
Read more >sklearn.model_selection.GridSearchCV
Exhaustive search over specified parameter values for an estimator. Important members are fit, predict. GridSearchCV implements a “fit” and a “score” method ...
Read more >Intro to Model Tuning: Grid and Random Search - Kaggle
Random search: set up a grid of hyperparameter values and select random ... This will cause a permission error in Python and the...
Read more >Dask and Scikit-Learn -- Model Parallelism - Matthew Rocklin
GridSearchCV posed a different problem. Due to the refit keyword, the implementation can't be done in a single pass over the data. This...
Read more >Find optimal parameters using GridSearchCV - ProjectPro
ProjectPro can help you get best parameters from gridsearchcv. ... Dream of Becoming a Data Scientist with 70+ Solved End-to-End ML Projects ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Interestingly,
always fails (never at the first iteration of the for loop). Note the use of a pandas dataframe for
X_train
.However when
X_train
is a numpy arraydoes not fail.
The fact that I cannot reproduce with a VM might be caused by the fact that memory mapped files might behave differently in a VM.
I will try to reproduce with a CI worker in this PR: https://github.com/joblib/joblib/pull/942