Running UMAP (?) on 10m-100m sized datasets?
See original GitHub issueHi again, everyone!
Would like to ask for some advice, probably UMAP
related, probably more abstract.
I have had some success using this combination of tools in my work:
- Get a lot of data (~2m rows * 300 columns (word embeddings));
- Use
PCA
to reduce dimensions to 30-50; - Then use
UMAP
to reduce to 10; - Then use
HDBSCAN
for clustering; - Then use this to enrich the data and some naive affinity propagation / trees for final post-processing to make the data usable for end-users;
All of this fine and dandy, but from my experiments it seems that with datasets around 10m points I start having trouble with the above scheme.
Obviously I can try different tricks (feeding HDBSCAN from PCA directly, reducing dimensions more aggressively e.g. 300=>20=>10, sub-sampling data and then just applying .transform
, buying more 🐏 , etc).
But maybe someone knows if there are any approaches for producing high-quality embeddings in a parallelized / mini-batch fashion? If so, can you probably point me in the correct direction? I know, that people train matrix factorization machines on the GPU, maybe there is something similar?
Or maybe I am just missing something with my setup?
I understand, that in the long run, the correct approach is
- Get annotation;
- Find the best CNN and train a classifier;
- Afaik, it may even work with up to ~5,000 classes;
Many thanks!
Issue Analytics
- State:
- Created 5 years ago
- Reactions:3
- Comments:9 (9 by maintainers)
Top GitHub Comments
@lmcinnes
So far, I managed to do the following
It turns out that
graph.data
contains only ones and zeroes andepochs_per_sample
therefore explodes and the max value isinf
. Does this mean that I have to pre-process the KNN graph somehow before feeding it like this, or am I making some error?As for tests:
It is also ones and zeros. What is interesting - I compared your KNN graph and faiss KNN graph - they seem almost identical. I guess the devil lies somewhere in the data types (I believe I save my graph knn indices in float). I will check several times and revert.