[Bug][Datasets] TypeError: self.hive_partitioning,self.partitioning,self.wrapped cannot be converted to a Python object for pickling
See original GitHub issueSearch before asking
- I searched the issues and found no similar issues.
Ray Component
Others
What happened + What you expected to happen
Discuss user reported this issue: https://discuss.ray.io/t/multiple-designatedblockowner-processes/4819/6
Versions / Dependencies
Nightly wheels (839bc5019f61cabb122f9b341721a9bd04680238
at time of writing)
User reported that it works fine on 1.10 and 1.9.2
Reproduction script
https://gist.github.com/mmuru/0d194ce09678e1ddd8515078276e12ac
Still happens if ray[data]
is used instead of manual installing pyarrow + pandas
Full traceback:
Traceback (most recent call last):
File "python/ray/_raylet.pyx", line 415, in ray._raylet.prepare_args_internal
File "/Users/cwong/anaconda3/envs/partitionrepro2/lib/python3.7/site-packages/ray/serialization.py", line 412, in serialize
return self._serialize_to_msgpack(value)
File "/Users/cwong/anaconda3/envs/partitionrepro2/lib/python3.7/site-packages/ray/serialization.py", line 391, in _serialize_to_msgpack
metadata, python_objects
File "/Users/cwong/anaconda3/envs/partitionrepro2/lib/python3.7/site-packages/ray/serialization.py", line 352, in _serialize_to_pickle5
raise e
File "/Users/cwong/anaconda3/envs/partitionrepro2/lib/python3.7/site-packages/ray/serialization.py", line 348, in _serialize_to_pickle5
value, protocol=5, buffer_callback=writer.buffer_callback
File "/Users/cwong/anaconda3/envs/partitionrepro2/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 73, in dumps
cp.dump(obj)
File "/Users/cwong/anaconda3/envs/partitionrepro2/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 620, in dump
return Pickler.dump(self, obj)
File "stringsource", line 2, in pyarrow._dataset.HivePartitioning.__reduce_cython__
TypeError: self.hive_partitioning,self.partitioning,self.wrapped cannot be converted to a Python object for pickling
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "test.py", line 49, in <module>
ray_dataset_part2 = ray.data.read_parquet(table_location_part, **arrow_parquet_args)
File "/Users/cwong/anaconda3/envs/partitionrepro2/lib/python3.7/site-packages/ray/data/read_api.py", line 372, in read_parquet
**arrow_parquet_args,
File "/Users/cwong/anaconda3/envs/partitionrepro2/lib/python3.7/site-packages/ray/data/read_api.py", line 221, in read_datasource
datasource, ctx, parallelism, _wrap_s3_filesystem_workaround(read_args)
File "/Users/cwong/anaconda3/envs/partitionrepro2/lib/python3.7/site-packages/ray/remote_function.py", line 166, in _remote_proxy
return self._remote(args=args, kwargs=kwargs)
File "/Users/cwong/anaconda3/envs/partitionrepro2/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 303, in _invocation_remote_span
return method(self, args, kwargs, *_args, **_kwargs)
File "/Users/cwong/anaconda3/envs/partitionrepro2/lib/python3.7/site-packages/ray/remote_function.py", line 462, in _remote
return invocation(args, kwargs)
File "/Users/cwong/anaconda3/envs/partitionrepro2/lib/python3.7/site-packages/ray/remote_function.py", line 449, in invocation
runtime_env or "{}",
File "python/ray/_raylet.pyx", line 1513, in ray._raylet.CoreWorker.submit_task
File "python/ray/_raylet.pyx", line 1517, in ray._raylet.CoreWorker.submit_task
File "python/ray/_raylet.pyx", line 381, in ray._raylet.prepare_args_and_increment_put_refs
File "python/ray/_raylet.pyx", line 372, in ray._raylet.prepare_args_and_increment_put_refs
File "python/ray/_raylet.pyx", line 423, in ray._raylet.prepare_args_internal
TypeError: Could not serialize the argument {'paths': 'ray_test_ds_part', 'filesystem': None, 'columns': None, 'dataset_kwargs': {'partitioning': <pyarrow._dataset.HivePartitioning object at 0x7fee694645f0>}} for a task or actor ray.data.read_api._prepare_read. Check https://docs.ray.io/en/master/serialization.html#troubleshooting for more information.
Anything else
No response
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Issue Analytics
- State:
- Created 2 years ago
- Comments:10 (10 by maintainers)
Top Results From Across the Web
Python multiprocessing - 'TypeError('self.profile cannot be ...
I did a bit more research, and it turns out that the NashProfile class isn't a pure Python class and doesn't support being...
Read more >pickle — Python object serialization — Python 3.11.1 ...
“Pickling” is the process whereby a Python object hierarchy is converted into a byte stream, and “unpickling” is the inverse operation, whereby a...
Read more >The Python pickle Module: How to Persist Objects in Python
In this tutorial, you'll learn how you can use the Python pickle module to convert your objects into a stream of bytes that...
Read more >pickle — Python object serialization - GeeksforGeeks
Unpickling: It is the inverse of Pickling process where a byte stream is converted into an object hierarchy. Module Interface : dumps() –...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
That could work, although might be a bit messy. I think these are the mitigation options:
pa.HivePartitioning()
and do metadata resolution locally if found.pa.HivePartitioning()
and friends, similar to what we did forpa.fs.S3FileSystem()
.I’d lean towards (3) if it’s possible.
@clarkzinzow could you triage?