Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pickling error when using UMAP with pynndescent in a Spark-based environment

See original GitHub issue

When using UMAP (without pynndescent) in a Spark-based environment, it works fine.

Instead, when using UMAP with pynndescent, I get the following error:

_pickle.PicklingError: Failed in nopython mode pipeline (step: nopython mode backend)
Can't pickle <class 'collections.FlatTree'>: attribute lookup FlatTree on collections failed

Traceback (most recent call last):
  File "/conda-env/lib/python3.6/site-packages/numba/core/pythonapi.py", line 1328, in serialize_object
    gv = self.module.__serialized[obj]
KeyError: <class 'collections.FlatTree'>

I use numba 0.50.1

Issue Analytics

State:
Created 3 years ago
Reactions:4
Comments:12 (4 by maintainers)

Top GitHub Comments

30reactions

kmurphy4commented, Jun 16, 2021

I was running into this today, and I think I found the cause. It looks like pyspark is overwriting collections.namedtuple with their own function, but that function isn’t correctly setting the __module__ field.

Here’s a minimal reproduction:

>>> # don't load `pyspark`
>>> import collections
>>> collections.namedtuple("n", [])
<class '__main__.n'>

>>> import pyspark
>>> import collections
>>> collections.namedtuple("n", [])
<class 'collections.n'>

I think there are a couple options for “fixing” this:

Fix the issue in pyspark: either set the __module__ correctly (as the stdlib does) or “un-hijack” collection.namedtuple after they’re done with it.
Pass the module= kwarg to namedtuple as __name__. This avoids needing to make a change to pyspark but does require a change to pynndescent (seems like @lmcinnes is the maintainer of that one too, so it shouldn’t be too hard 😄 ). For example:
```
>>> import pyspark
>>> import collections
>>> collections.namedtuple("n", [], module=__name__)
<class '__main__.n'>
```

In the meantime, users can monkey-patch pynndescent.rp_trees.FlatTree after importing it to get around the bug:

>>> import pyspark
>>> import pynndescent
>>> pynndescent.rp_trees.FlatTree.__module__  = "pynndescent.rp_trees"
>>> ... # my code goes after this

This is definitely the grossest solution, but it should work until one of the better solutions gets merged & released.

@lmcinnes if you think (2) is a reasonable course of action, I’m happy to open a PR.

0reactions

Wh1ispercommented, Apr 13, 2022

Sorry to bring this one back, but I am having exactly the same issue with pyspark on colab. I am performing an hyperoptimization on UMAP and this happens.

I use
>>> import pyspark
>>> import pynndescent
>>> pynndescent.rp_trees.FlatTree.__module__  = "pynndescent.rp_trees"
then in fmin from hyperopt I set the following argument:
>>> trials=hyperopt.SparkTrials(n_trials)
and the error follows. Please note that if n_trials are None or a negative number, it seems to work properly for some trials, otherwise it spits back this error.

I’d like also to understand how can I choose to deactivate pynndescent from UMAP.

Thank you in advance.

BTW the error is:

Error Message edit 2:

if I run umap as it is, it gives me no issues. I think this must be a problem between hyperopt, spark and umap.