question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

KeyError with joblib and sklearn cross_validate

See original GitHub issue

This was originally reported as https://github.com/dask/distributed/issues/2532

I tried be reproduce it on joblib master with scikit-learn 0.20.2 as follows.

import os
os.environ['SKLEARN_SITE_JOBLIB'] = "1"
from dask.distributed import Client
from sklearn import datasets, linear_model
from sklearn.model_selection import cross_validate
import joblib


client = Client(processes=False)
joblib.parallel_backend('dask')

diabetes = datasets.load_diabetes()
X = diabetes.data[:150]
y = diabetes.target[:150]
model = linear_model.LinearRegression()

cv_results = cross_validate(model, X, y, cv=10, return_train_score=False,
                            verbose=100)

However this seem to freeze without reporting the original error. Instead when I interrupt with ctrl-c I get:

[Parallel(n_jobs=-1)]: Using backend DaskDistributedBackend with 4 concurrent workers.
[CV]  ................................................................
[CV] .................................... , score=0.587, total=   0.0s
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    0.0s
^CTraceback (most recent call last):
  File "/home/ogrisel/code/joblib/joblib/_dask.py", line 223, in maybe_to_futures
    f = call_data_futures[arg]
  File "/home/ogrisel/code/joblib/joblib/_dask.py", line 56, in __getitem__
    ref, val = self._data[id(obj)]
KeyError: 140545475489024

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ogrisel/tmp/joblib_dask_freeze.py", line 18, in <module>
    verbose=100)
  File "/home/ogrisel/code/scikit-learn/sklearn/model_selection/_validation.py", line 231, in cross_validate
    for train, test in cv.split(X, y, groups))
  File "/home/ogrisel/code/joblib/joblib/parallel.py", line 924, in __call__
    while self.dispatch_one_batch(iterator):
  File "/home/ogrisel/code/joblib/joblib/parallel.py", line 759, in dispatch_one_batch
    self._dispatch(tasks)
  File "/home/ogrisel/code/joblib/joblib/parallel.py", line 716, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/home/ogrisel/code/joblib/joblib/_dask.py", line 254, in apply_async
    func, args = self._to_func_args(func)
  File "/home/ogrisel/code/joblib/joblib/_dask.py", line 243, in _to_func_args
    args = list(maybe_to_futures(args))
  File "/home/ogrisel/code/joblib/joblib/_dask.py", line 231, in maybe_to_futures
    [f] = self.client.scatter([arg])
  File "/home/ogrisel/code/distributed/distributed/client.py", line 1875, in scatter
    asynchronous=asynchronous, hash=hash)
  File "/home/ogrisel/code/distributed/distributed/client.py", line 676, in sync
    return sync(self.loop, func, *args, **kwargs)
  File "/home/ogrisel/code/distributed/distributed/utils.py", line 275, in sync
    e.wait(10)
  File "/opt/python3.7/lib/python3.7/threading.py", line 552, in wait
    signaled = self._cond.wait(timeout)
  File "/opt/python3.7/lib/python3.7/threading.py", line 300, in wait
    gotit = waiter.acquire(True, timeout)
KeyboardInterrupt

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:22 (19 by maintainers)

github_iconTop GitHub Comments

1reaction
jjerphancommented, Jul 20, 2019

Thanks @pierreglaser ; I’ll have a look at it ASAP (cc @samronsin).

1reaction
ogriselcommented, Jul 18, 2019

Indeed this is what I thought. I wonder why we don’t get the same issue with the call to self.client.submit. Apparently it always work as a synchronous call probably because it’s expected to always be fast and therefore would not really benefit from an asynchronous=True option.

To implement solution 2. we would need to refactor all the functions called under apply_async to work either as synchronous functions or awaitable coroutines using the self.client.sync trick already used in self.client.scatter. I will try to find the time to give it a try this afternoon or tomorrow. If anywayone else wants to try, please feel free to do it as well though 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

KeyError when loading pickled scikit-learn model using joblib
With me, happened that I exported the model using from sklearn.externals import joblib and tried to load ...
Read more >
sklearn.model_selection.cross_validate
Training the estimator and computing the score are parallelized over the cross-validation splits. None means 1 unless in a joblib.parallel_backend context.
Read more >
Forecasting with Prophet (not complete yet) - Kaggle
... from sklearn.externals import joblib from sklearn.metrics import m import ... _engine.get_loc(key) 3079 except KeyError: pandas/_libs/index.pyx in ...
Read more >
Python | Parallelism for ML using scikit|learn, joblib & PySpark
K-fold cross-validation involves randomly dividing the set of observations into k groups, or folds, of approximately equal size. The first fold ...
Read more >
sklearn's cross_validate does not work with catboost
So I want to use sklearn's cross validation, which works fine if I use ... Could you identify where in the error traceback...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found