`Overflow encountered in true_divide error` when using Aligned UMAP
See original GitHub issueHi! I’m a relatively new UMAP user, using Aligned UMAP to visualize the results of K-means clustering on a a corpus of text documents across time.
As each time window has a number of documents that can be found in the succeeding time window, I generate a dictionary of relations
and also obtain the distances of the documents from one another using the following process:
def get_distance(similarity):
slice_dist = 1 - similarity # similarity -> numpy array of TFIDF scores
slice_dist[slice_dist <= 0] = 0
return slice_dist
def get_relation(from_df, to_df):
slice1_ids = from_df['ids'].reset_index().drop(['received'], axis=1)
slice2_ids = to_df['ids'].reset_index().drop(['received'], axis=1)
shared_ids = list(set(slice2_ids['id'].tolist()) & set(slice1_ids['id'].tolist()))
ind1 = slice1_ids[slice1_ids['id'].isin(shared_ids)]
ind2 = slice2_ids[slice2_ids['id'].isin(shared_ids)]
relation = {}
index1 = list(ind1.index)
index2 = list(ind2.index)
for i, item in enumerate(index1):
relation[item] = index2[i]
return relation
relations = []
for j, mat in slices.items():
%time mat['distance'] = get_distance(mat['similarity'])
if j > sliceKeys[0]:
prev_mat = slices[j-1]
%time relations.append(get_relation(prev_mat, mat))
distances = [] # Each time slice's distance is added to an array so that I have an array of distances
for j, mat in slices.items():
distances.append(mat['distance'])
My Aligned UMAP settings are as follows:
%%time
aligned_mapper = umap.AlignedUMAP(n_neighbors=5,
min_dist=0.05,).fit(distances, relations=relations)
My distances array looks like this:
Previously this approach gave me no issues. However, I’ve been testing out new results and have been getting the error below over and over.
/Users/bianchi_dy/opt/anaconda3/lib/python3.7/site-packages/umap/spectral.py:256: UserWarning: WARNING: spectral initialisation failed! The eigenvector solver
failed. This is likely due to too small an eigengap. Consider
adding some noise or jitter to your data.
Falling back to random initialisation!
"WARNING: spectral initialisation failed! The eigenvector solver\n"
/Users/bianchi_dy/opt/anaconda3/lib/python3.7/site-packages/umap/umap_.py:905: RuntimeWarning: overflow encountered in true_divide
result[n_samples > 0] = float(n_epochs) / n_samples[n_samples > 0]`
and the following traceback, which tells me I’m dividing by zero somewhere I’m not supposed to be?
--------------------
LinAlgErrorTraceback (most recent call last)
<timed exec> in <module>
~/opt/anaconda3/lib/python3.7/site-packages/umap/aligned_umap.py in fit(self, X, y, **fit_params)
357 embeddings[-1],
358 next_embedding,
--> 359 np.vstack([left_anchors, right_anchors]),
360 )
361 )
~/opt/anaconda3/lib/python3.7/site-packages/numba/np/linalg.py in _check_finite_matrix()
751 if not np.isfinite(v.item()):
752 raise np.linalg.LinAlgError(
--> 753 "Array must not contain infs or NaNs.")
754
755
LinAlgError: Array must not contain infs or NaNs.
Any ideas as to what might be causing this error or how to fix it? My suspicion is that it’s to do with distances
but I’m not sure if I need to perform some sort of normalization or pre-processing aside from turning TFIDF similarity scores into distances. Unfortunately this error came up the night before a deadline I was intending to use Aligned UMAP for, so it’d be great if anyone could point me in the right direction to solving this even in a hacky way.
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (5 by maintainers)
Top GitHub Comments
I’ve fixed this issue with pull request #875. Basically the problem was in umap_.py line 919:
result[n_samples > 0] = float(n_epochs) / n_samples[n_samples > 0]
where the guard part of the statement didn’t match the calculation. The easiest fix was casting n_samples from np.float32 to np.float64 to match the type of result.
result[n_samples > 0] = float(n_epochs) / np.float64(n_samples[n_samples > 0])
This could have alternatively been fixed by refining the guard part of the statement to something like:
result[n_samples/n_epochs > 0] = float(n_epochs) / n_samples[n_samples/n_epochs > 0]
but that solution looks worse.
Thanks for the reproducer. I’ll try to look into this when I get a little time.