question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

parallel processes freezing when matrices are too big

See original GitHub issue

NOTE: It looks like this is a bug in the MKL libraries, not with joblib. Try using OpenBLAS and see if that helps.

Recently my processes have started freezing when I run models with large-ish data, and I’m not sure why. Unfortunately, I can’t follow this back to a specific change that I might have made.

Basically, this works fine:

from sklearn.cross_validation import cross_val_score
from numpy.random import randn
from sklearn.linear_model import Ridge

X = randn(815000, 100)
y = randn(815000, 1)
mod = Ridge()
sc = cross_val_score(mod, X, y, n_jobs=3)

while running the following code:

from sklearn.cross_validation import cross_val_score
from numpy.random import randn
from sklearn.linear_model import Ridge

X = randn(815000, 300)
y = randn(815000, 1)
mod = Ridge()
sc = cross_val_score(mod, X, y, n_jobs=3)

results in the forked processes hanging.

What happens is that first a number of processes spawn off, and they churn away at the data for a while. This is top after a few seconds of running the above code:

image

however, after another 10 seconds or so, these processes have finished, and another set of processes are created that hang:

image

You can see the processes that spawned off, and that none of them are chewing up any CPU time. It remains in this state indefinitely…

I thought this might be a problem with joblib trying to memmap things, but both matrices are well over the max_nbytes default for Parallel (at least, according to X.nbytes).

Note that these matrices, and ones larger than them, have worked totally fine in the past for fitting these kinds of models. I’m not really sure what’s going on…

I’m using: sklearn version: 0.14.1 joblib version: 0.8.0a3 (though this also breaks on 7.1) all packages linked against MKL (though I tried it after removing MKL in anaconda, and it still hangs) Unix machine (CentOS)

Note: I wasn’t sure whether this was a SKL problem or a joblib problem, so let me know if I should open this on the SKL repo instead. I just used cross_val_score because it was the simplest way to describe what happens, but this problem exists when I’m using my own parallelization code w/ joblib as well.

Issue Analytics

  • State:open
  • Created 9 years ago
  • Comments:37 (14 by maintainers)

github_iconTop GitHub Comments

1reaction
GaelVaroquauxcommented, Feb 6, 2017

Thanks for the info.

1reaction
arthurmenschcommented, Jul 7, 2015

+1 I am still encountering this issue with joblib 0.8.4 with anaconda and MKL

Read more comments on GitHub >

github_iconTop Results From Across the Web

Parallelization caveats in R #1: performance issues
My simulations, too, seemed to stall and I thought it was because I was using matrices of size 500×500 — they ran fine...
Read more >
In R how to create big matrix in parallel - Stack Overflow
It is very time-consuming. So I want to do it in parallel. For example, the matrix is 10^7 x 16, for each column...
Read more >
Freezing transitions and extreme values: random matrix theory ...
We argue that the freezing transition scenario, previously conjectured to occur in the statistical mechanics of 1/f-noise random energy ...
Read more >
Can I speed up big matrix operation with Parallel computing ...
If your data fits into memory, the standard matrix operations are already optimised to take best advantage of your hardware. If your data...
Read more >
Performance Tuning Guide - PyTorch
As a result the main training process has to wait for the data to be ... OpenMP is utilized to bring better performance...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found