question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Running UMAP (?) on 10m-100m sized datasets?

See original GitHub issue

Hi again, everyone! Would like to ask for some advice, probably UMAP related, probably more abstract.

I have had some success using this combination of tools in my work:

  • Get a lot of data (~2m rows * 300 columns (word embeddings));
  • Use PCA to reduce dimensions to 30-50;
  • Then use UMAP to reduce to 10;
  • Then use HDBSCAN for clustering;
  • Then use this to enrich the data and some naive affinity propagation / trees for final post-processing to make the data usable for end-users;

All of this fine and dandy, but from my experiments it seems that with datasets around 10m points I start having trouble with the above scheme.

Obviously I can try different tricks (feeding HDBSCAN from PCA directly, reducing dimensions more aggressively e.g. 300=>20=>10, sub-sampling data and then just applying .transform, buying more 🐏 , etc).

But maybe someone knows if there are any approaches for producing high-quality embeddings in a parallelized / mini-batch fashion? If so, can you probably point me in the correct direction? I know, that people train matrix factorization machines on the GPU, maybe there is something similar?

Or maybe I am just missing something with my setup?

I understand, that in the long run, the correct approach is

  • Get annotation;
  • Find the best CNN and train a classifier;
  • Afaik, it may even work with up to ~5,000 classes;

Many thanks!

Issue Analytics

  • State:open
  • Created 5 years ago
  • Reactions:3
  • Comments:9 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
snakers4commented, Aug 26, 2018

@lmcinnes

So far, I managed to do the following

  • Build a KNN graph using faiss - here is a gist that I assembled - maybe someone will find it useful;
  • Feed this knn graph to get these objects (the most interesting part in terms of porting it into PyTorch obviously is how you do weighting)
# knn_indices is a knn graph from faiss (n_samples, n_neighbors)
# knn_dists is knn distances from faiss (n_samples, n_neighbors)
 
graph = fuzzy_simplicial_set(
    X=bc_vectors_sample,
    n_neighbors=100,
    random_state=np.random.RandomState(seed=42),
    metric='euclidean',
    metric_kwds={},
    knn_indices=knn_indices,
    knn_dists=knn_dists,
    angular=False,
    set_op_mix_ratio=1.0,
    local_connectivity=1.0,
    verbose=True,
)

graph = graph.tocoo()
graph.sum_duplicates()
n_vertices = graph.shape[1]

n_epochs = 200

graph.data[graph.data < (graph.data.max() / float(n_epochs))] = 0.0
graph.eliminate_zeros()
epochs_per_sample = make_epochs_per_sample(graph.data, 200)

It turns out that graph.data contains only ones and zeroes and epochs_per_sample therefore explodes and the max value is inf. Does this mean that I have to pre-process the KNN graph somehow before feeding it like this, or am I making some error?

As for tests:

  • Creating a spectral embedding seems feasible only for datasets below 1-2m points;
  • For larger datasets, memory starts to become an issue (I doubt that library users will realistically have more than 128 GB of RAM);
  • As for faiss, my results were the following:
    • Plain index does not fit on one conventional GPU;
    • Approximate index that is supported on GPU takes ~500MB of GPU RAM (17% of RAM are reserved) => I guess it can even be scaled much and much further;
    • KNN graph for ~7-10m points is calculated ~2 hours tops;
    • I guess it can be scaled up to 100m of points this way easily;
0reactions
snakers4commented, Aug 28, 2018

coming out of fuzzy_simplicial_set

It is also ones and zeros. What is interesting - I compared your KNN graph and faiss KNN graph - they seem almost identical. I guess the devil lies somewhere in the data types (I believe I save my graph knn indices in float). I will check several times and revert.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Frequently Asked Questions — umap 0.5 documentation
If your dataset is not especially large but you have found that UMAP runs out of memory when operating on it consider using...
Read more >
How to Analyze 100-Dimensional Data with UMAP in ...
Today, we will learn how to analyze multi-dimensional datasets by ... I suggest running UMAP on a machine with at least 16GB of...
Read more >
Understanding UMAP
Figure 1: Apply UMAP projection to various toy datasets, ... Most importantly, UMAP is fast, scaling well in terms of both dataset size...
Read more >
Performance Comparison of Dimension Reduction ... - UMAP
As the size of a dataset increases the runtime of a given dimension reduction algorithm will increase at varying rates. If you ever...
Read more >
umap Documentation - Read the Docs
We will need numpy obviously, but we'll use some of the datasets ... UMAP(). Before we can do any work with the data...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found