Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

NaN with large (>1M rows) embeddings

See original GitHub issue

I’ve been trying prime-factor-space embeddings of larger numbers of integers. However, when I go to ~5M points, UMAP starts producing results which are all NaN for the embedding.

Maybe I’m pushing my luck with the dataset size here, but it seems it should work given enough RAM 😃

Setup

Python: 3.6.4 (Linux/x64 on AWS EC2 r4x4.large, 122GB RAM)
UMAP: 0.3.2
Metric: “cosine”, init “random”
Input: (16_777_214, 1_077_871) binary matrix, 51_096_439 non-zero entries, scipy.sparse.csr format, dtype float64

Attempts to debug

I thought initially it was spectral initialisation causing the issue, but “random” still has the issue.
All values in the input array are finite, non-NaN
I tried running the numba cosine metric exactly as implemented in UMAP on random pairs of vectors for several million iterations, but never got NaN or inf, as expected.

Example code

For 2^24 points, but happens at least at 5M also. First 1M rows works correctly.:

     X = scipy.sparse.load_npz("factorized_16777216.npz")
     embedding = umap.UMAP(metric='cosine', init='random', n_epochs=500, verbose=2).fit_transform(X)
     np.save('embedded_16777216_pts.npy'.format(max_n), embedding.astype(np.float32))

The data file factorized_16777216.npz is here: https://drive.google.com/open?id=1SnpvkoqfX4-u-BS8KfWealz0VFHZ5goc [100MB]

The same problem can be reproduced by taking the first 5M rows, and then it fits into about ~40GB RAM

Is there any way to debug where/when this is happening? I suppose I can use np.seterr() to trap NaNs but not sure whether that will help with numba accelerated parts.