Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug][Datasets] TypeError: self.hive_partitioning,self.partitioning,self.wrapped cannot be converted to a Python object for pickling

See original GitHub issue

Search before asking

I searched the issues and found no similar issues.

Ray Component

Others

What happened + What you expected to happen

Discuss user reported this issue: https://discuss.ray.io/t/multiple-designatedblockowner-processes/4819/6

Versions / Dependencies

Nightly wheels (839bc5019f61cabb122f9b341721a9bd04680238 at time of writing)

User reported that it works fine on 1.10 and 1.9.2

Reproduction script

https://gist.github.com/mmuru/0d194ce09678e1ddd8515078276e12ac

Still happens if ray[data] is used instead of manual installing pyarrow + pandas

Full traceback:

Traceback (most recent call last):
  File "python/ray/_raylet.pyx", line 415, in ray._raylet.prepare_args_internal
  File "/Users/cwong/anaconda3/envs/partitionrepro2/lib/python3.7/site-packages/ray/serialization.py", line 412, in serialize
    return self._serialize_to_msgpack(value)
  File "/Users/cwong/anaconda3/envs/partitionrepro2/lib/python3.7/site-packages/ray/serialization.py", line 391, in _serialize_to_msgpack
    metadata, python_objects
  File "/Users/cwong/anaconda3/envs/partitionrepro2/lib/python3.7/site-packages/ray/serialization.py", line 352, in _serialize_to_pickle5
    raise e
  File "/Users/cwong/anaconda3/envs/partitionrepro2/lib/python3.7/site-packages/ray/serialization.py", line 348, in _serialize_to_pickle5
    value, protocol=5, buffer_callback=writer.buffer_callback
  File "/Users/cwong/anaconda3/envs/partitionrepro2/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 73, in dumps
    cp.dump(obj)
  File "/Users/cwong/anaconda3/envs/partitionrepro2/lib/python3.7/site-packages/ray/cloudpickle/cloudpickle_fast.py", line 620, in dump
    return Pickler.dump(self, obj)
  File "stringsource", line 2, in pyarrow._dataset.HivePartitioning.__reduce_cython__
TypeError: self.hive_partitioning,self.partitioning,self.wrapped cannot be converted to a Python object for pickling

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "test.py", line 49, in <module>
    ray_dataset_part2 = ray.data.read_parquet(table_location_part, **arrow_parquet_args)
  File "/Users/cwong/anaconda3/envs/partitionrepro2/lib/python3.7/site-packages/ray/data/read_api.py", line 372, in read_parquet
    **arrow_parquet_args,
  File "/Users/cwong/anaconda3/envs/partitionrepro2/lib/python3.7/site-packages/ray/data/read_api.py", line 221, in read_datasource
    datasource, ctx, parallelism, _wrap_s3_filesystem_workaround(read_args)
  File "/Users/cwong/anaconda3/envs/partitionrepro2/lib/python3.7/site-packages/ray/remote_function.py", line 166, in _remote_proxy
    return self._remote(args=args, kwargs=kwargs)
  File "/Users/cwong/anaconda3/envs/partitionrepro2/lib/python3.7/site-packages/ray/util/tracing/tracing_helper.py", line 303, in _invocation_remote_span
    return method(self, args, kwargs, *_args, **_kwargs)
  File "/Users/cwong/anaconda3/envs/partitionrepro2/lib/python3.7/site-packages/ray/remote_function.py", line 462, in _remote
    return invocation(args, kwargs)
  File "/Users/cwong/anaconda3/envs/partitionrepro2/lib/python3.7/site-packages/ray/remote_function.py", line 449, in invocation
    runtime_env or "{}",
  File "python/ray/_raylet.pyx", line 1513, in ray._raylet.CoreWorker.submit_task
  File "python/ray/_raylet.pyx", line 1517, in ray._raylet.CoreWorker.submit_task
  File "python/ray/_raylet.pyx", line 381, in ray._raylet.prepare_args_and_increment_put_refs
  File "python/ray/_raylet.pyx", line 372, in ray._raylet.prepare_args_and_increment_put_refs
  File "python/ray/_raylet.pyx", line 423, in ray._raylet.prepare_args_internal
TypeError: Could not serialize the argument {'paths': 'ray_test_ds_part', 'filesystem': None, 'columns': None, 'dataset_kwargs': {'partitioning': <pyarrow._dataset.HivePartitioning object at 0x7fee694645f0>}} for a task or actor ray.data.read_api._prepare_read. Check https://docs.ray.io/en/master/serialization.html#troubleshooting for more information.

Anything else

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Issue Analytics

State:
Created 2 years ago
Comments:10 (10 by maintainers)

Top GitHub Comments

1reaction

clarkzinzowcommented, Feb 17, 2022

That could work, although might be a bit messy. I think these are the mitigation options:

Submit metadata resolution task, retry locally if it fails with a certain type of error.
Explicitly check for unserializable arguments such as pa.HivePartitioning() and do metadata resolution locally if found.
Create custom serialization wrapper for pa.HivePartitioning() and friends, similar to what we did for pa.fs.S3FileSystem().

I’d lean towards (3) if it’s possible.

1reaction

ericlcommented, Feb 17, 2022

@clarkzinzow could you triage?