[Datasets] [Bug] Piece Serialization with cloudpickle is very slow.
See original GitHub issueSearch before asking
- I searched the issues and found no similar issues.
Ray Component
Others
What happened + What you expected to happen
I tried to read many parquet files (~188k) from GCS.
But I have an issue where cloudpickle.dumps()
takes a long time (3h 30m)
7%|███████▌ | 12524/188370 [15:14<3:28:14, 14.07it/s]
Is there a good way to do it quickly?
Versions / Dependencies
- Ubuntu (Google Cloud Engine)
- Python 3.8
- Ray 1.9.0
Reproduction script
python/ray/data/datasource/parquet_datasource.py
I modified parquet_datasource.py
for progress bar.
https://github.com/ray-project/ray/blob/92599d9127e228fe8d0a2d94ca75754ec21c4ae4/python/ray/data/datasource/parquet_datasource.py#L131
from tqdm import tqdm
serialized_pieces = [cloudpickle.dumps(p) for p in tqdm(pq_ds.pieces)]
main.py
data = ray.data.read_parquet(paths=['gs://<BUCKET>/parquets/'], # read many parquet files (~188k)
filesystem=GCSFileSystem(),
columns=['_id'],
ray_remote_args={"num_cpus": 0.5},
parallelism=1024)
print(data)
Anything else
I think this issue is related with #19089
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (5 by maintainers)
Top Results From Across the Web
[Datasets] [Bug] Piece Serialization with cloudpickle is very slow.
I tried to read many parquet files (~188k) from GCS. But I have an issue where cloudpickle.dumps() takes a long time (3h 30m)....
Read more >Serialization — Ray 2.2.0 - the Ray documentation
Ray has decided to use a customized Pickle protocol version 5 backport to ... for you (fail to serialize some objects, too slow...
Read more >Embarrassingly parallel for loops - Joblib - Read the Docs
The main drawback of cloudpickle is that it can be slower than the pickle module in ... If you wish to use the...
Read more >cloudpickle - PyPI
cloudpickle makes it possible to serialize Python constructs not supported by the default pickle module from the Python standard library.
Read more >Changelog — Dask.distributed 2022.12.1 documentation
In rare cases, this could make some workloads slower. ... Reverted a bug where Bokeh was accidentally made non-optional (GH#7230) Oliver Holworthy.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
This fixed it for me, thanks!
@eggie5 @mwbyeon We recently added a mitigation that resulted in a ~10x speedup in our many-file nightly test, could you try the wheel for this specific commit?
https://docs.ray.io/en/master/ray-overview/installation.html#installing-from-a-specific-commit
The commit in question is cf3577f0ee8c09c4315def1593545487432fe3e2
This would be installed on Linux for Python 3.7 via:
pip install https://s3-us-west-2.amazonaws.com/ray-wheels/master/cf3577f0ee8c09c4315def1593545487432fe3e2/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl