KeyError with joblib and sklearn cross_validate
See original GitHub issueThis was originally reported as https://github.com/dask/distributed/issues/2532
I tried be reproduce it on joblib master with scikit-learn 0.20.2 as follows.
import os
os.environ['SKLEARN_SITE_JOBLIB'] = "1"
from dask.distributed import Client
from sklearn import datasets, linear_model
from sklearn.model_selection import cross_validate
import joblib
client = Client(processes=False)
joblib.parallel_backend('dask')
diabetes = datasets.load_diabetes()
X = diabetes.data[:150]
y = diabetes.target[:150]
model = linear_model.LinearRegression()
cv_results = cross_validate(model, X, y, cv=10, return_train_score=False,
verbose=100)
However this seem to freeze without reporting the original error. Instead when I interrupt with ctrl-c I get:
[Parallel(n_jobs=-1)]: Using backend DaskDistributedBackend with 4 concurrent workers.
[CV] ................................................................
[CV] .................................... , score=0.587, total= 0.0s
[Parallel(n_jobs=-1)]: Done 1 tasks | elapsed: 0.0s
^CTraceback (most recent call last):
File "/home/ogrisel/code/joblib/joblib/_dask.py", line 223, in maybe_to_futures
f = call_data_futures[arg]
File "/home/ogrisel/code/joblib/joblib/_dask.py", line 56, in __getitem__
ref, val = self._data[id(obj)]
KeyError: 140545475489024
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ogrisel/tmp/joblib_dask_freeze.py", line 18, in <module>
verbose=100)
File "/home/ogrisel/code/scikit-learn/sklearn/model_selection/_validation.py", line 231, in cross_validate
for train, test in cv.split(X, y, groups))
File "/home/ogrisel/code/joblib/joblib/parallel.py", line 924, in __call__
while self.dispatch_one_batch(iterator):
File "/home/ogrisel/code/joblib/joblib/parallel.py", line 759, in dispatch_one_batch
self._dispatch(tasks)
File "/home/ogrisel/code/joblib/joblib/parallel.py", line 716, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/home/ogrisel/code/joblib/joblib/_dask.py", line 254, in apply_async
func, args = self._to_func_args(func)
File "/home/ogrisel/code/joblib/joblib/_dask.py", line 243, in _to_func_args
args = list(maybe_to_futures(args))
File "/home/ogrisel/code/joblib/joblib/_dask.py", line 231, in maybe_to_futures
[f] = self.client.scatter([arg])
File "/home/ogrisel/code/distributed/distributed/client.py", line 1875, in scatter
asynchronous=asynchronous, hash=hash)
File "/home/ogrisel/code/distributed/distributed/client.py", line 676, in sync
return sync(self.loop, func, *args, **kwargs)
File "/home/ogrisel/code/distributed/distributed/utils.py", line 275, in sync
e.wait(10)
File "/opt/python3.7/lib/python3.7/threading.py", line 552, in wait
signaled = self._cond.wait(timeout)
File "/opt/python3.7/lib/python3.7/threading.py", line 300, in wait
gotit = waiter.acquire(True, timeout)
KeyboardInterrupt
Issue Analytics
- State:
- Created 5 years ago
- Comments:22 (19 by maintainers)
Top Results From Across the Web
KeyError when loading pickled scikit-learn model using joblib
With me, happened that I exported the model using from sklearn.externals import joblib and tried to load ...
Read more >sklearn.model_selection.cross_validate
Training the estimator and computing the score are parallelized over the cross-validation splits. None means 1 unless in a joblib.parallel_backend context.
Read more >Forecasting with Prophet (not complete yet) - Kaggle
... from sklearn.externals import joblib from sklearn.metrics import m import ... _engine.get_loc(key) 3079 except KeyError: pandas/_libs/index.pyx in ...
Read more >Python | Parallelism for ML using scikit|learn, joblib & PySpark
K-fold cross-validation involves randomly dividing the set of observations into k groups, or folds, of approximately equal size. The first fold ...
Read more >sklearn's cross_validate does not work with catboost
So I want to use sklearn's cross validation, which works fine if I use ... Could you identify where in the error traceback...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Thanks @pierreglaser ; I’ll have a look at it ASAP (cc @samronsin).
Indeed this is what I thought. I wonder why we don’t get the same issue with the call to
self.client.submit
. Apparently it always work as a synchronous call probably because it’s expected to always be fast and therefore would not really benefit from anasynchronous=True
option.To implement solution 2. we would need to refactor all the functions called under
apply_async
to work either as synchronous functions or awaitable coroutines using theself.client.sync
trick already used inself.client.scatter
. I will try to find the time to give it a try this afternoon or tomorrow. If anywayone else wants to try, please feel free to do it as well though 😃