NaN with large (>1M rows) embeddings
See original GitHub issueI’ve been trying prime-factor-space embeddings of larger numbers of integers. However, when I go to ~5M points, UMAP starts producing results which are all NaN for the embedding.
Maybe I’m pushing my luck with the dataset size here, but it seems it should work given enough RAM 😃
Setup
- Python: 3.6.4 (Linux/x64 on AWS EC2 r4x4.large, 122GB RAM)
- UMAP: 0.3.2
- Metric: “cosine”, init “random”
- Input: (16_777_214, 1_077_871) binary matrix, 51_096_439 non-zero entries, scipy.sparse.csr format, dtype float64
Attempts to debug
- I thought initially it was spectral initialisation causing the issue, but “random” still has the issue.
- All values in the input array are finite, non-NaN
- I tried running the numba cosine metric exactly as implemented in UMAP on random pairs of vectors for several million iterations, but never got NaN or inf, as expected.
Example code
For 2^24 points, but happens at least at 5M also. First 1M rows works correctly.:
X = scipy.sparse.load_npz("factorized_16777216.npz")
embedding = umap.UMAP(metric='cosine', init='random', n_epochs=500, verbose=2).fit_transform(X)
np.save('embedded_16777216_pts.npy'.format(max_n), embedding.astype(np.float32))
The data file factorized_16777216.npz
is here: https://drive.google.com/open?id=1SnpvkoqfX4-u-BS8KfWealz0VFHZ5goc [100MB]
The same problem can be reproduced by taking the first 5M rows, and then it fits into about ~40GB RAM
Is there any way to debug where/when this is happening? I suppose I can use np.seterr() to trap NaNs but not sure whether that will help with numba accelerated parts.
Issue Analytics
- State:
- Created 5 years ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
How to insert range of 'NaN' values each n rows into ...
You can do this: import numpy as np import pandas as pd data = np.array( [ 1,37.536866,15.068850, 2,37.536867,15.068850, 3,37.536868 ...
Read more >Text Classification with NLP: Tf-Idf vs Word2Vec vs BERT
This tutorial compares the old school approach of Bag-of-Words (used with a simple machine learning algorithm), the popular Word Embedding ...
Read more >Why do l get NaN values when l train my neural network with a ...
You probably have a numerical stability issue. This may happen due to zero division or any operation that is making a number(s) extremely...
Read more >How To Recommend Anything / Deep Recommender - Kaggle
A large, sparse matrix will be created in this step. Each row will represent a user and its ratings and the columns are...
Read more >Introduction to Facebook AI Similarity Search (Faiss) - Pinecone
This isn't a particularly large number, so let's pull in a few more similar ... remove duplicates and NaN sentences = [word for...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Fantastic, I can confirm the solution fixes the problem, at least for my use case! Here’s a picture of 8M integers, hot off the press
I’m very impressed UMAP scales to this size of dataset.
Seems resolved. Closing.