Pickling fitted UMAP on sparse array with more than 4096 rows leads to an error
See original GitHub issueI found a couple of similar issues #556 #547, but none exactly like the one I describe I believe. Apologies if this has been already covered in one of the other issues.
Here is the minimal reproducible example:
import pickle
import scipy
import umap
mat = scipy.sparse.random(4097, 100)
reducer = umap.UMAP()
reducer.fit_transform(mat)
with open('test.pkl', 'wb') as f:
pickle.dump(reducer, f)
It returns:
numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Invalid use of type(CPUDispatcher(<function select_side at 0x111e088b0>)) with parameters (readonly array(float32, 2d, C), float32, readonly array(float32, 1d, C), array(int64, 1d, C))
Known signatures:
* (array(float32, 1d, C), float32, array(float32, 1d, C), array(int64, 1d, C)) -> bool
* (readonly array(float32, 1d, C), float32, readonly array(float32, 1d, C), array(int64, 1d, C)) -> bool
During: resolving callee type: type(CPUDispatcher(<function select_side at 0x111e088b0>))
During: typing of call at /Users/campea/notebooks_experimental/env/lib/python3.8/site-packages/pynndescent/pynndescent_.py (1181)
File "env/lib/python3.8/site-packages/pynndescent/pynndescent_.py", line 1181:
def tree_search_closure(point, rng_state):
<source elided>
while tree_children[node, 0] > 0:
side = select_side(
^
Note that if we replace reducer.fit_transform(mat)
for reducer.fit_transform(mat.todense())
it works, but that means I have to kill the sparse array structure, which is not ideal.
I reproduced the error on a minimal virtualenv (python3.8) with only UMAP and its deps:
joblib==1.0.1
llvmlite==0.36.0
numba==0.53.1
numpy==1.20.3
pynndescent==0.5.2
scikit-learn==0.24.2
scipy==1.6.3
threadpoolctl==2.1.0
umap-learn==0.5.1
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:5 (1 by maintainers)
Top Results From Across the Web
UMAP on sparse data - Read the Docs
This tutorial will walk through a couple of examples of doing this. First we'll need some libraries loaded.
Read more >cuML API Reference — cuml 22.10.00 documentation
This is intended for cases when fit() is not feasible due to very large number of n_samples or because X is read from...
Read more >(PDF) Data Science Life Cycle Sheet | Dametreus Vincent
Many data distributions have much longer tails than • Reducible: error that can potentially be reduced the normal or Poisson distributions.
Read more >Appl. Sci., Volume 11, Issue 22 (November-2 2021) - MDPI
Most regulations only allow the use of the coarse fraction of recycled concrete aggregate (RCA) for the manufacture of new concrete, although the...
Read more >https://raw.githubusercontent.com/Fraser-Greenlee/...
{"text": "assert issubclass(initialState.dtype.type, np.integer) and initialState.ndim==1, \"initialState %r is not a one-dimensional integer numpy array\" ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Yes, I was aggregating together some smaller issues. I think it is about time for a release. Sorry for being slow.
Ok I found the bug, it’s coming from the package pynndescent. in the function __getstate__ that’s being called before pickling, there is no differentiation between _init_search_function and _init_sparse_search_function.
Here is a temporary fix on pynndescent_.py:927
@lmcinnes
Edit: Looks like it’s already fixed but not released yet