Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

UMAP hangs when training in a multiprocessing.Process

See original GitHub issue

Hey Leland,

Thanks for the great library.

I’ve got a strange error. Looks like umap training completely hangs if it is run inside a multiprocessing.Process. Minimum example on py3.8.5:

import umap
import multiprocessing
import numpy as np
import sys
import time


def train_model(q=None):
  embeddings = np.random.rand(100, 512)
  reducer = umap.UMAP()
  print("Got reducer, about to start training")
  sys.stdout.flush()
  if not q:
    return reducer.fit_transform(embeddings)
  print("outputting to q")
  q.put(reducer.fit_transform(embeddings))
  print("output to q")


start = time.time()
model_output = train_model()
print('normal took: ', time.time() - start)
print('got: ', model_output)

start = time.time()
q = multiprocessing.Queue()
p = multiprocessing.Process(target=train_model, args=(q,), daemon=True)
p.start()
model_output = q.get()
print('multi took: ', time.time() - start)
print('got: ', model_output)

This results in the following output:

(env) amol@amol-small:~/code/soot/api-server/src$ python umap_multiprocessing_test.py                                                                                                                              
2021-06-24 16:09:46.233186: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such
 file or directory                                                                                                                                                                                                 
2021-06-24 16:09:46.233212: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.                                                      
Got reducer, about to start training                                                                                                                                                                               
normal took:  7.140857934951782                                                                                                                                                                                    
got:  [[ 5.585276  10.613853 ]                                                                                                                                                                                     
 [ 3.6862304  8.075892 ]                                                                                                                                                                                           
 [ 4.7457848  8.287621 ]                                                                                                                                                                                           
 [ 3.1373663  9.443794 ]                                                                                                                                                                                           
 [ 3.3923576  8.651798 ]                                                                                                                                                                                           
 [ 5.8636594 10.131909 ]                                                                                                                                                                                           
 [ 3.6680114 11.535476 ]                                                                                                                                                                                           
 [ 1.924135   9.987121 ]                                                                                                                                                                                           
 [ 4.9095764  8.643579 ]                                                                                                                                                                                           
 ...
 [ 4.6614685  9.943193 ]                            
 [ 3.5867712 10.872507 ]                            
 [ 4.8476524 10.628259 ]]                           
Got reducer, about to start training
outputting to q

after which I have to cntrl-C because nothing happens.

Any ideas what is going on?

Issue Analytics

State:
Created 2 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

2reactions

theahuracommented, Jun 25, 2021

Like many umap/numba problems, switching to a different backend fixed the problem. I was previously using workqueues, which would just hang. I switched to ‘omp’, which showed me an actual error:

Terminating: fork() called from a process already using GNU OpenMP, this is unsafe.

Switching to tbb seemed to work with the minimal example above, though I had a fair bit of trouble getting tbb to actually load (see: https://github.com/numba/numba/issues/7148)

I’ll close this out, but this was definitely a weird interaction between numba and some other multiprocessing stuff. Seems brittle, but not really sure what’s to be done about it 🤔

0reactions

lmcinnescommented, Jun 25, 2021

Glad you found a solution, but it definitely seems brittle. In general the tbb backend seems to fix most problems, but it sadly is not the default a lot of the time for users.