question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pickling fitted UMAP on sparse array with more than 4096 rows leads to an error

See original GitHub issue

I found a couple of similar issues #556 #547, but none exactly like the one I describe I believe. Apologies if this has been already covered in one of the other issues.

Here is the minimal reproducible example:

import pickle
import scipy
import umap

mat = scipy.sparse.random(4097, 100)

reducer = umap.UMAP()

reducer.fit_transform(mat)

with open('test.pkl', 'wb') as f:
    pickle.dump(reducer, f)

It returns:

numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Invalid use of type(CPUDispatcher(<function select_side at 0x111e088b0>)) with parameters (readonly array(float32, 2d, C), float32, readonly array(float32, 1d, C), array(int64, 1d, C))
Known signatures:
 * (array(float32, 1d, C), float32, array(float32, 1d, C), array(int64, 1d, C)) -> bool
 * (readonly array(float32, 1d, C), float32, readonly array(float32, 1d, C), array(int64, 1d, C)) -> bool
During: resolving callee type: type(CPUDispatcher(<function select_side at 0x111e088b0>))
During: typing of call at /Users/campea/notebooks_experimental/env/lib/python3.8/site-packages/pynndescent/pynndescent_.py (1181)


File "env/lib/python3.8/site-packages/pynndescent/pynndescent_.py", line 1181:
        def tree_search_closure(point, rng_state):
            <source elided>
            while tree_children[node, 0] > 0:
                side = select_side(
                ^

Note that if we replace reducer.fit_transform(mat) for reducer.fit_transform(mat.todense()) it works, but that means I have to kill the sparse array structure, which is not ideal.

I reproduced the error on a minimal virtualenv (python3.8) with only UMAP and its deps:

joblib==1.0.1
llvmlite==0.36.0
numba==0.53.1
numpy==1.20.3
pynndescent==0.5.2
scikit-learn==0.24.2
scipy==1.6.3
threadpoolctl==2.1.0
umap-learn==0.5.1

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:1
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

3reactions
lmcinnescommented, Jul 2, 2021

Yes, I was aggregating together some smaller issues. I think it is about time for a release. Sorry for being slow.

0reactions
Osherz5commented, Jun 27, 2021

Ok I found the bug, it’s coming from the package pynndescent. in the function __getstate__ that’s being called before pickling, there is no differentiation between _init_search_function and _init_sparse_search_function.

Here is a temporary fix on pynndescent_.py:927

def __getstate__(self):
        if not hasattr(self, "_search_graph"):
            self._init_search_graph()
        if not hasattr(self, "_search_function"):
            if self._is_sparse: # The fix
                self._init_sparse_search_function()
            else:
                self._init_search_function()

        result = self.__dict__.copy()
        if hasattr(self, "_rp_forest"):
            del result["_rp_forest"]
        result["_search_forest"] = tuple(
            [denumbaify_tree(tree) for tree in self._search_forest]
        )
        return result

@lmcinnes

Edit: Looks like it’s already fixed but not released yet

Read more comments on GitHub >

github_iconTop Results From Across the Web

UMAP on sparse data - Read the Docs
This tutorial will walk through a couple of examples of doing this. First we'll need some libraries loaded.
Read more >
cuML API Reference — cuml 22.10.00 documentation
This is intended for cases when fit() is not feasible due to very large number of n_samples or because X is read from...
Read more >
(PDF) Data Science Life Cycle Sheet | Dametreus Vincent
Many data distributions have much longer tails than • Reducible: error that can potentially be reduced the normal or Poisson distributions.
Read more >
Appl. Sci., Volume 11, Issue 22 (November-2 2021) - MDPI
Most regulations only allow the use of the coarse fraction of recycled concrete aggregate (RCA) for the manufacture of new concrete, although the...
Read more >
https://raw.githubusercontent.com/Fraser-Greenlee/...
{"text": "assert issubclass(initialState.dtype.type, np.integer) and initialState.ndim==1, \"initialState %r is not a one-dimensional integer numpy array\" ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found