Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`Overflow encountered in true_divide error` when using Aligned UMAP

See original GitHub issue

Hi! I’m a relatively new UMAP user, using Aligned UMAP to visualize the results of K-means clustering on a a corpus of text documents across time.

As each time window has a number of documents that can be found in the succeeding time window, I generate a dictionary of relations and also obtain the distances of the documents from one another using the following process:

def get_distance(similarity):
    slice_dist = 1 - similarity # similarity -> numpy array of TFIDF scores
    slice_dist[slice_dist <= 0] = 0
    return slice_dist


def get_relation(from_df, to_df):
    slice1_ids = from_df['ids'].reset_index().drop(['received'], axis=1)
    slice2_ids = to_df['ids'].reset_index().drop(['received'], axis=1)

    shared_ids = list(set(slice2_ids['id'].tolist()) & set(slice1_ids['id'].tolist())) 
    ind1 = slice1_ids[slice1_ids['id'].isin(shared_ids)]
    ind2 = slice2_ids[slice2_ids['id'].isin(shared_ids)]
    
    relation = {}
    index1 = list(ind1.index)
    index2 = list(ind2.index)

    for i, item in enumerate(index1):
        relation[item] = index2[i]
        
    return relation

relations = []

for j, mat in slices.items():
    %time mat['distance'] = get_distance(mat['similarity'])
    
    if j > sliceKeys[0]:
        prev_mat = slices[j-1]
        %time relations.append(get_relation(prev_mat, mat))

distances = [] # Each time slice's distance is added to an array so that I have an array of distances
for j, mat in slices.items():
    distances.append(mat['distance'])

My Aligned UMAP settings are as follows:

%%time
aligned_mapper = umap.AlignedUMAP(n_neighbors=5,
    min_dist=0.05,).fit(distances, relations=relations)

My distances array looks like this:

Previously this approach gave me no issues. However, I’ve been testing out new results and have been getting the error below over and over.

/Users/bianchi_dy/opt/anaconda3/lib/python3.7/site-packages/umap/spectral.py:256: UserWarning: WARNING: spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.

Falling back to random initialisation!
  "WARNING: spectral initialisation failed! The eigenvector solver\n"
/Users/bianchi_dy/opt/anaconda3/lib/python3.7/site-packages/umap/umap_.py:905: RuntimeWarning: overflow encountered in true_divide
  result[n_samples > 0] = float(n_epochs) / n_samples[n_samples > 0]`

and the following traceback, which tells me I’m dividing by zero somewhere I’m not supposed to be?

--------------------
LinAlgErrorTraceback (most recent call last)
<timed exec> in <module>

~/opt/anaconda3/lib/python3.7/site-packages/umap/aligned_umap.py in fit(self, X, y, **fit_params)
    357                     embeddings[-1],
    358                     next_embedding,
--> 359                     np.vstack([left_anchors, right_anchors]),
    360                 )
    361             )

~/opt/anaconda3/lib/python3.7/site-packages/numba/np/linalg.py in _check_finite_matrix()
    751         if not np.isfinite(v.item()):
    752             raise np.linalg.LinAlgError(
--> 753                 "Array must not contain infs or NaNs.")
    754 
    755 

LinAlgError: Array must not contain infs or NaNs.

Any ideas as to what might be causing this error or how to fix it? My suspicion is that it’s to do with distances but I’m not sure if I need to perform some sort of normalization or pre-processing aside from turning TFIDF similarity scores into distances. Unfortunately this error came up the night before a deadline I was intending to use Aligned UMAP for, so it’d be great if anyone could point me in the right direction to solving this even in a hacky way.

Issue Analytics

State:
Created 2 years ago
Comments:6 (5 by maintainers)

Top GitHub Comments

2reactions

GregDemandcommented, Jun 17, 2022

I’ve fixed this issue with pull request #875. Basically the problem was in umap_.py line 919:

result[n_samples > 0] = float(n_epochs) / n_samples[n_samples > 0]

where the guard part of the statement didn’t match the calculation. The easiest fix was casting n_samples from np.float32 to np.float64 to match the type of result.

result[n_samples > 0] = float(n_epochs) / np.float64(n_samples[n_samples > 0])

This could have alternatively been fixed by refining the guard part of the statement to something like:

result[n_samples/n_epochs > 0] = float(n_epochs) / n_samples[n_samples/n_epochs > 0]

but that solution looks worse.

0reactions

lmcinnescommented, Nov 16, 2021

Thanks for the reproducer. I’ll try to look into this when I get a little time.