question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

UMAP hangs when training in a multiprocessing.Process

See original GitHub issue

Hey Leland,

Thanks for the great library.

I’ve got a strange error. Looks like umap training completely hangs if it is run inside a multiprocessing.Process. Minimum example on py3.8.5:

import umap
import multiprocessing
import numpy as np
import sys
import time


def train_model(q=None):
  embeddings = np.random.rand(100, 512)
  reducer = umap.UMAP()
  print("Got reducer, about to start training")
  sys.stdout.flush()
  if not q:
    return reducer.fit_transform(embeddings)
  print("outputting to q")
  q.put(reducer.fit_transform(embeddings))
  print("output to q")


start = time.time()
model_output = train_model()
print('normal took: ', time.time() - start)
print('got: ', model_output)

start = time.time()
q = multiprocessing.Queue()
p = multiprocessing.Process(target=train_model, args=(q,), daemon=True)
p.start()
model_output = q.get()
print('multi took: ', time.time() - start)
print('got: ', model_output)

This results in the following output:

(env) amol@amol-small:~/code/soot/api-server/src$ python umap_multiprocessing_test.py                                                                                                                              
2021-06-24 16:09:46.233186: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such
 file or directory                                                                                                                                                                                                 
2021-06-24 16:09:46.233212: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.                                                      
Got reducer, about to start training                                                                                                                                                                               
normal took:  7.140857934951782                                                                                                                                                                                    
got:  [[ 5.585276  10.613853 ]                                                                                                                                                                                     
 [ 3.6862304  8.075892 ]                                                                                                                                                                                           
 [ 4.7457848  8.287621 ]                                                                                                                                                                                           
 [ 3.1373663  9.443794 ]                                                                                                                                                                                           
 [ 3.3923576  8.651798 ]                                                                                                                                                                                           
 [ 5.8636594 10.131909 ]                                                                                                                                                                                           
 [ 3.6680114 11.535476 ]                                                                                                                                                                                           
 [ 1.924135   9.987121 ]                                                                                                                                                                                           
 [ 4.9095764  8.643579 ]                                                                                                                                                                                           
 ...
 [ 4.6614685  9.943193 ]                            
 [ 3.5867712 10.872507 ]                            
 [ 4.8476524 10.628259 ]]                           
Got reducer, about to start training
outputting to q   

after which I have to cntrl-C because nothing happens.

Any ideas what is going on?

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
theahuracommented, Jun 25, 2021

Like many umap/numba problems, switching to a different backend fixed the problem. I was previously using workqueues, which would just hang. I switched to ‘omp’, which showed me an actual error:

Terminating: fork() called from a process already using GNU OpenMP, this is unsafe.

Switching to tbb seemed to work with the minimal example above, though I had a fair bit of trouble getting tbb to actually load (see: https://github.com/numba/numba/issues/7148)

I’ll close this out, but this was definitely a weird interaction between numba and some other multiprocessing stuff. Seems brittle, but not really sure what’s to be done about it 🤔

0reactions
lmcinnescommented, Jun 25, 2021

Glad you found a solution, but it definitely seems brittle. In general the tbb backend seems to fix most problems, but it sadly is not the default a lot of the time for users.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Training a Python UMAP model hangs in a multiprocessing ...
Looks like umap training completely hangs if it is run inside a multiprocessing.Process. Minimum example on py3.8.5:
Read more >
Handling Hang in Python Multiprocessing - Sefik Ilkin Serengil
Workaround. I used apply_async method of multiprocessing. pool module to handle the deadlock problem. I stored the response of the apply_async ...
Read more >
Frequently Asked Questions — umap 0.5 documentation
The best way to do this is to use pre-processing tools from scikit-learn. All the advice given there applies as sensible preprocessing for...
Read more >
Speeding up Transformer w/ Optimization Strategies - Kaggle
This is an intensive process and if the training is done in folds might result in GPU Runtime Limit Reached. Solution. There are...
Read more >
Using Driverless AI - H2O.ai Documentation
Added improved handling of duplicate rows in training data (after ... Auto-generated, editable, Python code of the Best Models from any ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found