Feature request: Parallel/Multicore implementation of t-SNE
See original GitHub issueHas there been any discussion on implementing a multicore version of t-SNE in sklearn?
The fastest version that I have seen/used is https://github.com/claczny/VizBin/tree/master/src/backend/bh_tsne .
I think a few simple addition would be to change:
line 160 of https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/manifold/t_sne.py from:
pdist(X_embedded, "sqeuclidean")
to the following:
sklearn.metrics.pairwise
to use the n_jobs
parameter
In regards to the actual algorithm, https://github.com/DmitryUlyanov/Multicore-TSNE has implemented this in a different programming language so I cannot see what has been done. I believe his method only works for 2D embeddings (which could be fine if noted) and is very fast.
I also had very basic (and potentially naive) ideas in using bayesian optimization to speed up the algorithm as well if anyone has any insight: https://www.reddit.com/r/MachineLearning/comments/78i9rh/discussion_bayesian_optimization_of_tsne/ but this may not be the right place for that.
Just trying to think of ways to use this on larger datasets.
Issue Analytics
- State:
- Created 6 years ago
- Comments:14 (11 by maintainers)
Top GitHub Comments
I believe parallelising the neighbour computation (be it via balltrees, or full distance computation) would be relatively straightforward. There are, of course, seriously diminishing returns on that for much more than 4 or 8 cores. Paralellising the gradient descent is rather harder, but I think there are even less gains to be had there. I would be willing to take a look at this at some point if people are still interested.
neighbors computation has been multithreaded in #15082 and the gradient computation has been parallelized in #13264. Is this issue still relevant? Thanks.