question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug][Datasets] TypeError: self.hive_partitioning,self.partitioning,self.wrapped cannot be converted to a Python object for pickling

See original GitHub issue

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Others

What happened + What you expected to happen

Discuss user reported this issue: https://discuss.ray.io/t/multiple-designatedblockowner-processes/4819/6

Versions / Dependencies

Nightly wheels (839bc5019f61cabb122f9b341721a9bd04680238 at time of writing)

User reported that it works fine on 1.10 and 1.9.2

Reproduction script

https://gist.github.com/mmuru/0d194ce09678e1ddd8515078276e12ac

Still happens if ray[data] is used instead of manual installing pyarrow + pandas

Full traceback:

Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 415, in ray._raylet.prepare_args_internal
  File "/Users/cwong/anaconda3/envs/partitionrepro2/lib/python3.7/site-packages/ray/serialization.py", line 412, in serialize
    return self._serialize_to_msgpack(value)
  File "/Users/cwong/anaconda3/envs/partitionrepro2/lib/python3.7/site-packages/ray/serialization.py", line 391, in _serialize_to_msgpack
    metadata, python_objects
  File "/Users/cwong/anaconda3/envs/partitionrepro2/lib/python3.7/site-packages/ray/serialization.py", line 352, in _serialize_to_pickle5
    raise e
  File "/Users/cwong/anaconda3/envs/partitionrepro2/lib/python3.7/site-packages/ray/serialization.py", line 348, in _serialize_to_pickle5
    value, protocol=5, buffer_callback=writer.buffer_callback
  File "/Users/cwong/anaconda3/envs/partitionrepro2/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 73, in dumps
    cp.dump(obj)
  File "/Users/cwong/anaconda3/envs/partitionrepro2/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 620, in dump
    return Pickler.dump(self, obj)
  File "stringsource", line 2, in pyarrow._dataset.HivePartitioning.__reduce_cython__
TypeError: self.hive_partitioning,self.partitioning,self.wrapped cannot be converted to a Python object for pickling

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "test.py", line 49, in <module>
    ray_dataset_part2 = ray.data.read_parquet(table_location_part, **arrow_parquet_args)
  File "/Users/cwong/anaconda3/envs/partitionrepro2/lib/python3.7/site-packages/ray/data/read_api.py", line 372, in read_parquet
    **arrow_parquet_args,
  File "/Users/cwong/anaconda3/envs/partitionrepro2/lib/python3.7/site-packages/ray/data/read_api.py", line 221, in read_datasource
    datasource, ctx, parallelism, _wrap_s3_filesystem_workaround(read_args)
  File "/Users/cwong/anaconda3/envs/partitionrepro2/lib/python3.7/site-packages/ray/remote_function.py", line 166, in _remote_proxy
    return self._remote(args=args, kwargs=kwargs)
  File "/Users/cwong/anaconda3/envs/partitionrepro2/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 303, in _invocation_remote_span
    return method(self, args, kwargs, *_args, **_kwargs)
  File "/Users/cwong/anaconda3/envs/partitionrepro2/lib/python3.7/site-packages/ray/remote_function.py", line 462, in _remote
    return invocation(args, kwargs)
  File "/Users/cwong/anaconda3/envs/partitionrepro2/lib/python3.7/site-packages/ray/remote_function.py", line 449, in invocation
    runtime_env or "{}",
  File "python/ray/_raylet.pyx", line 1513, in ray._raylet.CoreWorker.submit_task
  File "python/ray/_raylet.pyx", line 1517, in ray._raylet.CoreWorker.submit_task
  File "python/ray/_raylet.pyx", line 381, in ray._raylet.prepare_args_and_increment_put_refs
  File "python/ray/_raylet.pyx", line 372, in ray._raylet.prepare_args_and_increment_put_refs
  File "python/ray/_raylet.pyx", line 423, in ray._raylet.prepare_args_internal
TypeError: Could not serialize the argument {'paths': 'ray_test_ds_part', 'filesystem': None, 'columns': None, 'dataset_kwargs': {'partitioning': <pyarrow._dataset.HivePartitioning object at 0x7fee694645f0>}} for a task or actor ray.data.read_api._prepare_read. Check https://docs.ray.io/en/master/serialization.html#troubleshooting for more information.

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:10 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
clarkzinzowcommented, Feb 17, 2022

That could work, although might be a bit messy. I think these are the mitigation options:

  1. Submit metadata resolution task, retry locally if it fails with a certain type of error.
  2. Explicitly check for unserializable arguments such as pa.HivePartitioning() and do metadata resolution locally if found.
  3. Create custom serialization wrapper for pa.HivePartitioning() and friends, similar to what we did for pa.fs.S3FileSystem().

I’d lean towards (3) if it’s possible.

1reaction
ericlcommented, Feb 17, 2022

@clarkzinzow could you triage?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Python multiprocessing - 'TypeError('self.profile cannot be ...
I did a bit more research, and it turns out that the NashProfile class isn't a pure Python class and doesn't support being...
Read more >
pickle — Python object serialization — Python 3.11.1 ...
“Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation, whereby a...
Read more >
The Python pickle Module: How to Persist Objects in Python
In this tutorial, you'll learn how you can use the Python pickle module to convert your objects into a stream of bytes that...
Read more >
pickle — Python object serialization - GeeksforGeeks
Unpickling: It is the inverse of Pickling process where a byte stream is converted into an object hierarchy. Module Interface : dumps() –...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found