question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

update function of UMAP does not work

See original GitHub issue

I’m trying to build an incremental trainer for umap, updating on batches of data. I’m testing this out with mnist.

import numpy as np
import sklearn.datasets
import umap
import umap.utils as utils
import umap.aligned_umap
from sklearn.datasets import fetch_openml
from sklearn.decomposition import PCA

mnist = fetch_openml('mnist_784', version=1)
mnist.target = mnist.target.astype(int)

first, second = mnist.data[:50000], mnist.data[50000:]
print(first.shape, second.shape)

standard_embedding = umap.UMAP(random_state=42).fit(first)
standard_embedding.update(second)

on update I see

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/var/folders/3d/d0dl2ykn6c18qg7kg_j7tplm0000gn/T/ipykernel_98177/3602609767.py in <module>
----> 1 standard_embedding.update(second)

~/.pyenv/versions/3.9.6/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/umap/umap_.py in update(self, X)
   3129 
   3130         else:
-> 3131             self._knn_search_index.update(X)
   3132             self._raw_data = self._knn_search_index._raw_data
   3133             (

~/.pyenv/versions/3.9.6/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pynndescent/pynndescent_.py in update(self, X)
   1611         X = check_array(X, dtype=np.float32, accept_sparse="csr", order="C")
   1612 
-> 1613         original_order = np.argsort(self._vertex_order)
   1614 
   1615         if self._is_sparse:

AttributeError: 'NNDescent' object has no attribute '_vertex_order'

Is this expected behavior? Am I using UMAP improperly here? I see an example of aligned_umap but I was hoping to use the standard umap as I do not have relations

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:7 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
ThomasNickersoncommented, Oct 21, 2021

I actually ran into this problem yesterday and have a fix ready to go @lmcinnes, will open a PR. It’s only an issue for the n>4096 path in update.

0reactions
preet2312commented, Dec 16, 2022

Hey @lmcinnes & @vedrocks15 I am working on something similar and got the same error of divide by zero.
My dataset has more than 4M rows and 384 dimensions. While trying to reduce the dimension to 50, my 32 Gb RAM system doesn’t take all of the 4M rows at once and I had to go with Batch processing. I am trying to fit the small chunks of data to UMAP and in the process of doing that, update doesn’t seem to help much.

First of the small chunk: xvs[:10000].shape => (10000, 384)

model1 = umap.UMAP(
            n_neighbors=30,
            min_dist=0.0,
            n_components=50,
            random_state=42,
            ).fit(xvs[:10000])

model1.embedding_.shape => (10000, 50)

model1.update(xvs[10000:20000]) gives the following error

ZeroDivisionError                         Traceback (most recent call last)
Input In [13], in <cell line: 1>()
----> 1 model1.update(xvs[10000:20000])

File ~\anaconda3\envs\py39\lib\site-packages\umap\umap_.py:3348, in UMAP.update(self, X)
   3344 init = np.zeros(
   3345     (self._raw_data.shape[0], self.n_components), dtype=np.float32
   3346 )
   3347 init[:original_size] = self.embedding_
-> 3348 init_update(init, original_size, self._knn_indices)
   3350 if self.n_epochs is None:
   3351     n_epochs = 0

ZeroDivisionError: division by zero

But when I re-run the same update code, I get different error this time.

ValueError                                Traceback (most recent call last)
Input In [14], in <cell line: 1>()
----> 1 model1.update(xvs[10000:20000])

File ~\anaconda3\envs\py39\lib\site-packages\umap\umap_.py:3347, in UMAP.update(self, X)
   3329 self.graph_, self._sigmas, self._rhos = fuzzy_simplicial_set(
   3330     self._raw_data,
   3331     self.n_neighbors,
   (...)
   3341     self.verbose,
   3342 )
   3344 init = np.zeros(
   3345     (self._raw_data.shape[0], self.n_components), dtype=np.float32
   3346 )
-> 3347 init[:original_size] = self.embedding_
   3348 init_update(init, original_size, self._knn_indices)
   3350 if self.n_epochs is None:

ValueError: could not broadcast input array from shape (10000,50) into shape (20000,50)

Not sure how to approach this problem and if there is any better solution for the batch processing in UMAP as I just need to fit the chunks of data and I need model.embedding_ at the end to follow the next steps.

Thank you.!

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to resolve the error, "module umap has no attribute ...
1 - Solving on your machine by updating the library via git ... I changed the name to umap_application.py and the problem was...
Read more >
Frequently Asked Questions — umap 0.5 documentation
Compiled here are a set of frequently asked questions, along with answers. If you don't find your question listed here then please feel...
Read more >
How to Use UMAP — umap 0.5 documentation
UMAP is a general purpose manifold learning and dimension reduction algorithm. It is designed to be compatible with scikit-learn, making use of the...
Read more >
UMAP API Guide — umap 0.5 documentation - Read the Docs
Perform a fuzzy simplicial set embedding, using a specified initialisation method and then minimizing the fuzzy set cross entropy between the 1-skeletons of ......
Read more >
Transforming New Data with UMAP - Read the Docs
This works exactly as in the How to Use UMAP example using the fit method. In this case we simply hand it the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found