Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pickling fitted UMAP on sparse array with more than 4096 rows leads to an error

See original GitHub issue

I found a couple of similar issues #556 #547, but none exactly like the one I describe I believe. Apologies if this has been already covered in one of the other issues.

Here is the minimal reproducible example:

import pickle
import scipy
import umap

mat = scipy.sparse.random(4097, 100)

reducer = umap.UMAP()

reducer.fit_transform(mat)

with open('test.pkl', 'wb') as f:
    pickle.dump(reducer, f)

It returns:

numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Invalid use of type(CPUDispatcher(<function select_side at 0x111e088b0>)) with parameters (readonly array(float32, 2d, C), float32, readonly array(float32, 1d, C), array(int64, 1d, C))
Known signatures:
 * (array(float32, 1d, C), float32, array(float32, 1d, C), array(int64, 1d, C)) -> bool
 * (readonly array(float32, 1d, C), float32, readonly array(float32, 1d, C), array(int64, 1d, C)) -> bool
During: resolving callee type: type(CPUDispatcher(<function select_side at 0x111e088b0>))
During: typing of call at /Users/campea/notebooks_experimental/env/lib/python3.8/site-packages/pynndescent/pynndescent_.py (1181)


File "env/lib/python3.8/site-packages/pynndescent/pynndescent_.py", line 1181:
        def tree_search_closure(point, rng_state):
            <source elided>
            while tree_children[node, 0] > 0:
                side = select_side(
                ^

Note that if we replace reducer.fit_transform(mat) for reducer.fit_transform(mat.todense()) it works, but that means I have to kill the sparse array structure, which is not ideal.

I reproduced the error on a minimal virtualenv (python3.8) with only UMAP and its deps:

joblib==1.0.1
llvmlite==0.36.0
numba==0.53.1
numpy==1.20.3
pynndescent==0.5.2
scikit-learn==0.24.2
scipy==1.6.3
threadpoolctl==2.1.0
umap-learn==0.5.1

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:5 (1 by maintainers)

Top GitHub Comments

3reactions

lmcinnescommented, Jul 2, 2021

Yes, I was aggregating together some smaller issues. I think it is about time for a release. Sorry for being slow.

0reactions

Osherz5commented, Jun 27, 2021

Ok I found the bug, it’s coming from the package pynndescent. in the function __getstate__ that’s being called before pickling, there is no differentiation between _init_search_function and _init_sparse_search_function.

Here is a temporary fix on pynndescent_.py:927

def __getstate__(self):
        if not hasattr(self, "_search_graph"):
            self._init_search_graph()
        if not hasattr(self, "_search_function"):
            if self._is_sparse: # The fix
                self._init_sparse_search_function()
            else:
                self._init_search_function()

        result = self.__dict__.copy()
        if hasattr(self, "_rp_forest"):
            del result["_rp_forest"]
        result["_search_forest"] = tuple(
            [denumbaify_tree(tree) for tree in self._search_forest]
        )
        return result

@lmcinnes

Edit: Looks like it’s already fixed but not released yet

Top Results From Across the Web

UMAP on sparse data - Read the Docs

This tutorial will walk through a couple of examples of doing this. First we'll need some libraries loaded.

cuML API Reference — cuml 22.10.00 documentation

This is intended for cases when fit() is not feasible due to very large number of n_samples or because X is read from...

(PDF) Data Science Life Cycle Sheet | Dametreus Vincent

Many data distributions have much longer tails than • Reducible: error that can potentially be reduced the normal or Poisson distributions.

Appl. Sci., Volume 11, Issue 22 (November-2 2021) - MDPI

Most regulations only allow the use of the coarse fraction of recycled concrete aggregate (RCA) for the manufacture of new concrete, although the...

https://raw.githubusercontent.com/Fraser-Greenlee/...

{"text": "assert issubclass(initialState.dtype.type, np.integer) and initialState.ndim==1, \"initialState %r is not a one-dimensional integer numpy array\" ...