UMAP hangs when training in a multiprocessing.Process
See original GitHub issueHey Leland,
Thanks for the great library.
I’ve got a strange error. Looks like umap training completely hangs if it is run inside a multiprocessing.Process. Minimum example on py3.8.5:
import umap
import multiprocessing
import numpy as np
import sys
import time
def train_model(q=None):
embeddings = np.random.rand(100, 512)
reducer = umap.UMAP()
print("Got reducer, about to start training")
sys.stdout.flush()
if not q:
return reducer.fit_transform(embeddings)
print("outputting to q")
q.put(reducer.fit_transform(embeddings))
print("output to q")
start = time.time()
model_output = train_model()
print('normal took: ', time.time() - start)
print('got: ', model_output)
start = time.time()
q = multiprocessing.Queue()
p = multiprocessing.Process(target=train_model, args=(q,), daemon=True)
p.start()
model_output = q.get()
print('multi took: ', time.time() - start)
print('got: ', model_output)
This results in the following output:
(env) amol@amol-small:~/code/soot/api-server/src$ python umap_multiprocessing_test.py
2021-06-24 16:09:46.233186: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such
file or directory
2021-06-24 16:09:46.233212: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Got reducer, about to start training
normal took: 7.140857934951782
got: [[ 5.585276 10.613853 ]
[ 3.6862304 8.075892 ]
[ 4.7457848 8.287621 ]
[ 3.1373663 9.443794 ]
[ 3.3923576 8.651798 ]
[ 5.8636594 10.131909 ]
[ 3.6680114 11.535476 ]
[ 1.924135 9.987121 ]
[ 4.9095764 8.643579 ]
...
[ 4.6614685 9.943193 ]
[ 3.5867712 10.872507 ]
[ 4.8476524 10.628259 ]]
Got reducer, about to start training
outputting to q
after which I have to cntrl-C because nothing happens.
Any ideas what is going on?
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (2 by maintainers)
Top Results From Across the Web
Training a Python UMAP model hangs in a multiprocessing ...
Looks like umap training completely hangs if it is run inside a multiprocessing.Process. Minimum example on py3.8.5:
Read more >Handling Hang in Python Multiprocessing - Sefik Ilkin Serengil
Workaround. I used apply_async method of multiprocessing. pool module to handle the deadlock problem. I stored the response of the apply_async ...
Read more >Frequently Asked Questions — umap 0.5 documentation
The best way to do this is to use pre-processing tools from scikit-learn. All the advice given there applies as sensible preprocessing for...
Read more >Speeding up Transformer w/ Optimization Strategies - Kaggle
This is an intensive process and if the training is done in folds might result in GPU Runtime Limit Reached. Solution. There are...
Read more >Using Driverless AI - H2O.ai Documentation
Added improved handling of duplicate rows in training data (after ... Auto-generated, editable, Python code of the Best Models from any ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Like many umap/numba problems, switching to a different backend fixed the problem. I was previously using workqueues, which would just hang. I switched to ‘omp’, which showed me an actual error:
Terminating: fork() called from a process already using GNU OpenMP, this is unsafe.
Switching to tbb seemed to work with the minimal example above, though I had a fair bit of trouble getting tbb to actually load (see: https://github.com/numba/numba/issues/7148)
I’ll close this out, but this was definitely a weird interaction between numba and some other multiprocessing stuff. Seems brittle, but not really sure what’s to be done about it 🤔
Glad you found a solution, but it definitely seems brittle. In general the tbb backend seems to fix most problems, but it sadly is not the default a lot of the time for users.