Pickling error when using UMAP with pynndescent in a Spark-based environment
See original GitHub issueWhen using UMAP (without pynndescent) in a Spark-based environment, it works fine.
Instead, when using UMAP with pynndescent, I get the following error:
_pickle.PicklingError: Failed in nopython mode pipeline (step: nopython mode backend)
Can't pickle <class 'collections.FlatTree'>: attribute lookup FlatTree on collections failed
Traceback (most recent call last):
File "/conda-env/lib/python3.6/site-packages/numba/core/pythonapi.py", line 1328, in serialize_object
gv = self.module.__serialized[obj]
KeyError: <class 'collections.FlatTree'>
I use numba 0.50.1
Issue Analytics
- State:
- Created 3 years ago
- Reactions:4
- Comments:12 (4 by maintainers)
Top Results From Across the Web
UMAP PicklingError: ("Can't pickle <class 'numpy.dtype[float32]'>
The problem seems to be with Numpy. I was running 1.20 when hitting this error. Downgrading with pip install numpy==1.19. resolves it.
Read more >can't pickle listreverseiterator objects using pyspark - splunktool
When using UMAP (without pynndescent) in a Spark-based environment, it works fine.,A number of pickling issues were resolved in pynndescent v0.5 and umap...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I was running into this today, and I think I found the cause. It looks like
pyspark
is overwritingcollections.namedtuple
with their own function, but that function isn’t correctly setting the__module__
field.Here’s a minimal reproduction:
I think there are a couple options for “fixing” this:
pyspark
: either set the__module__
correctly (as the stdlib does) or “un-hijack”collection.namedtuple
after they’re done with it.module=
kwarg tonamedtuple
as__name__
. This avoids needing to make a change topyspark
but does require a change topynndescent
(seems like @lmcinnes is the maintainer of that one too, so it shouldn’t be too hard 😄 ). For example:In the meantime, users can monkey-patch
pynndescent.rp_trees.FlatTree
afterimport
ing it to get around the bug:This is definitely the grossest solution, but it should work until one of the better solutions gets merged & released.
@lmcinnes if you think (2) is a reasonable course of action, I’m happy to open a PR.
Same, but I’m not sure hyperopt used in my project.
I speculate if it has something to do with pyspark’s optimization of CPickle serialization in python < 3.8 environment?https://github.com/apache/spark/blob/f84018a4810867afa84658fec76494aaae6d57fc/python/pyspark/serializers.py#L361
I am using pyspark 3.1.2 and umap-learn 0.5.2, python 3.7.10