question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pickling error when using UMAP with pynndescent in a Spark-based environment

See original GitHub issue

When using UMAP (without pynndescent) in a Spark-based environment, it works fine.

Instead, when using UMAP with pynndescent, I get the following error:

_pickle.PicklingError: Failed in nopython mode pipeline (step: nopython mode backend)
Can't pickle <class 'collections.FlatTree'>: attribute lookup FlatTree on collections failed

Traceback (most recent call last):
  File "/conda-env/lib/python3.6/site-packages/numba/core/pythonapi.py", line 1328, in serialize_object
    gv = self.module.__serialized[obj]
KeyError: <class 'collections.FlatTree'>

I use numba 0.50.1

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:4
  • Comments:12 (4 by maintainers)

github_iconTop GitHub Comments

30reactions
kmurphy4commented, Jun 16, 2021

I was running into this today, and I think I found the cause. It looks like pyspark is overwriting collections.namedtuple with their own function, but that function isn’t correctly setting the __module__ field.

Here’s a minimal reproduction:

>>> # don't load `pyspark`
>>> import collections
>>> collections.namedtuple("n", [])
<class '__main__.n'>
>>> import pyspark
>>> import collections
>>> collections.namedtuple("n", [])
<class 'collections.n'>

I think there are a couple options for “fixing” this:

  1. Fix the issue in pyspark: either set the __module__ correctly (as the stdlib does) or “un-hijack” collection.namedtuple after they’re done with it.
  2. Pass the module= kwarg to namedtuple as __name__. This avoids needing to make a change to pyspark but does require a change to pynndescent (seems like @lmcinnes is the maintainer of that one too, so it shouldn’t be too hard 😄 ). For example:
    >>> import pyspark
    >>> import collections
    >>> collections.namedtuple("n", [], module=__name__)
    <class '__main__.n'>
    

In the meantime, users can monkey-patch pynndescent.rp_trees.FlatTree after importing it to get around the bug:

>>> import pyspark
>>> import pynndescent
>>> pynndescent.rp_trees.FlatTree.__module__  = "pynndescent.rp_trees"
>>> ... # my code goes after this

This is definitely the grossest solution, but it should work until one of the better solutions gets merged & released.

@lmcinnes if you think (2) is a reasonable course of action, I’m happy to open a PR.

0reactions
Wh1ispercommented, Apr 13, 2022

Sorry to bring this one back, but I am having exactly the same issue with pyspark on colab. I am performing an hyperoptimization on UMAP and this happens.

I use

>>> import pyspark
>>> import pynndescent
>>> pynndescent.rp_trees.FlatTree.__module__  = "pynndescent.rp_trees"

then in fmin from hyperopt I set the following argument:

>>> trials=hyperopt.SparkTrials(n_trials)

and the error follows. Please note that if n_trials are None or a negative number, it seems to work properly for some trials, otherwise it spits back this error.

I’d like also to understand how can I choose to deactivate pynndescent from UMAP.

Thank you in advance.

BTW the error is:

Error Message edit 2:

if I run umap as it is, it gives me no issues. I think this must be a problem between hyperopt, spark and umap.

Same, but I’m not sure hyperopt used in my project.

I speculate if it has something to do with pyspark’s optimization of CPickle serialization in python < 3.8 environment?https://github.com/apache/spark/blob/f84018a4810867afa84658fec76494aaae6d57fc/python/pyspark/serializers.py#L361

I am using pyspark 3.1.2 and umap-learn 0.5.2, python 3.7.10

Read more comments on GitHub >

github_iconTop Results From Across the Web

UMAP PicklingError: ("Can't pickle <class 'numpy.dtype[float32]'>
The problem seems to be with Numpy. I was running 1.20 when hitting this error. Downgrading with pip install numpy==1.19. resolves it.
Read more >
can't pickle listreverseiterator objects using pyspark - splunktool
When using UMAP (without pynndescent) in a Spark-based environment, it works fine.,A number of pickling issues were resolved in pynndescent v0.5 and umap...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found