question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

NaN with large (>1M rows) embeddings

See original GitHub issue

I’ve been trying prime-factor-space embeddings of larger numbers of integers. However, when I go to ~5M points, UMAP starts producing results which are all NaN for the embedding.

Maybe I’m pushing my luck with the dataset size here, but it seems it should work given enough RAM 😃

Setup

  • Python: 3.6.4 (Linux/x64 on AWS EC2 r4x4.large, 122GB RAM)
  • UMAP: 0.3.2
  • Metric: “cosine”, init “random”
  • Input: (16_777_214, 1_077_871) binary matrix, 51_096_439 non-zero entries, scipy.sparse.csr format, dtype float64

Attempts to debug

  • I thought initially it was spectral initialisation causing the issue, but “random” still has the issue.
  • All values in the input array are finite, non-NaN
  • I tried running the numba cosine metric exactly as implemented in UMAP on random pairs of vectors for several million iterations, but never got NaN or inf, as expected.

Example code

For 2^24 points, but happens at least at 5M also. First 1M rows works correctly.:

     X = scipy.sparse.load_npz("factorized_16777216.npz")
     embedding = umap.UMAP(metric='cosine', init='random', n_epochs=500, verbose=2).fit_transform(X)
     np.save('embedded_16777216_pts.npy'.format(max_n), embedding.astype(np.float32))

The data file factorized_16777216.npz is here: https://drive.google.com/open?id=1SnpvkoqfX4-u-BS8KfWealz0VFHZ5goc [100MB]

The same problem can be reproduced by taking the first 5M rows, and then it fits into about ~40GB RAM

Is there any way to debug where/when this is happening? I suppose I can use np.seterr() to trap NaNs but not sure whether that will help with numba accelerated parts.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
johnhwcommented, Sep 9, 2018

Fantastic, I can confirm the solution fixes the problem, at least for my use case! Here’s a picture of 8M integers, hot off the press

I’m very impressed UMAP scales to this size of dataset.

0reactions
sleighsoftcommented, Sep 16, 2019

Seems resolved. Closing.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to insert range of 'NaN' values each n rows into ...
You can do this: import numpy as np import pandas as pd data = np.array( [ 1,37.536866,15.068850, 2,37.536867,15.068850, 3,37.536868 ...
Read more >
Text Classification with NLP: Tf-Idf vs Word2Vec vs BERT
This tutorial compares the old school approach of Bag-of-Words (used with a simple machine learning algorithm), the popular Word Embedding ...
Read more >
Why do l get NaN values when l train my neural network with a ...
You probably have a numerical stability issue. This may happen due to zero division or any operation that is making a number(s) extremely...
Read more >
How To Recommend Anything / Deep Recommender - Kaggle
A large, sparse matrix will be created in this step. Each row will represent a user and its ratings and the columns are...
Read more >
Introduction to Facebook AI Similarity Search (Faiss) - Pinecone
This isn't a particularly large number, so let's pull in a few more similar ... remove duplicates and NaN sentences = [word for...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found