Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

parallel processes freezing when matrices are too big

See original GitHub issue

NOTE: It looks like this is a bug in the MKL libraries, not with joblib. Try using OpenBLAS and see if that helps.

Recently my processes have started freezing when I run models with large-ish data, and I’m not sure why. Unfortunately, I can’t follow this back to a specific change that I might have made.

Basically, this works fine:

from sklearn.cross_validation import cross_val_score
from numpy.random import randn
from sklearn.linear_model import Ridge

X = randn(815000, 100)
y = randn(815000, 1)
mod = Ridge()
sc = cross_val_score(mod, X, y, n_jobs=3)

while running the following code:

from sklearn.cross_validation import cross_val_score
from numpy.random import randn
from sklearn.linear_model import Ridge

X = randn(815000, 300)
y = randn(815000, 1)
mod = Ridge()
sc = cross_val_score(mod, X, y, n_jobs=3)

results in the forked processes hanging.

What happens is that first a number of processes spawn off, and they churn away at the data for a while. This is top after a few seconds of running the above code:

however, after another 10 seconds or so, these processes have finished, and another set of processes are created that hang:

You can see the processes that spawned off, and that none of them are chewing up any CPU time. It remains in this state indefinitely…

I thought this might be a problem with joblib trying to memmap things, but both matrices are well over the max_nbytes default for Parallel (at least, according to X.nbytes).

Note that these matrices, and ones larger than them, have worked totally fine in the past for fitting these kinds of models. I’m not really sure what’s going on…

I’m using: sklearn version: 0.14.1 joblib version: 0.8.0a3 (though this also breaks on 7.1) all packages linked against MKL (though I tried it after removing MKL in anaconda, and it still hangs) Unix machine (CentOS)

Note: I wasn’t sure whether this was a SKL problem or a joblib problem, so let me know if I should open this on the SKL repo instead. I just used cross_val_score because it was the simplest way to describe what happens, but this problem exists when I’m using my own parallelization code w/ joblib as well.