question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Datasets] [Bug] Piece Serialization with cloudpickle is very slow.

See original GitHub issue

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Others

What happened + What you expected to happen

I tried to read many parquet files (~188k) from GCS. But I have an issue where cloudpickle.dumps() takes a long time (3h 30m)

https://github.com/ray-project/ray/blob/92599d9127e228fe8d0a2d94ca75754ec21c4ae4/python/ray/data/datasource/parquet_datasource.py#L131

  7%|███████▌                                                         | 12524/188370 [15:14<3:28:14, 14.07it/s]

Is there a good way to do it quickly?

Versions / Dependencies

  • Ubuntu (Google Cloud Engine)
  • Python 3.8
  • Ray 1.9.0

Reproduction script

python/ray/data/datasource/parquet_datasource.py

I modified parquet_datasource.py for progress bar. https://github.com/ray-project/ray/blob/92599d9127e228fe8d0a2d94ca75754ec21c4ae4/python/ray/data/datasource/parquet_datasource.py#L131

        from tqdm import tqdm
        serialized_pieces = [cloudpickle.dumps(p) for p in tqdm(pq_ds.pieces)]

main.py

    data = ray.data.read_parquet(paths=['gs://<BUCKET>/parquets/'],  # read many parquet files (~188k)
        filesystem=GCSFileSystem(),
        columns=['_id'],
        ray_remote_args={"num_cpus": 0.5},
        parallelism=1024)
    print(data)

Anything else

I think this issue is related with #19089

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:9 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
eggie5commented, Mar 7, 2022

This fixed it for me, thanks!

1reaction
clarkzinzowcommented, Mar 1, 2022

@eggie5 @mwbyeon We recently added a mitigation that resulted in a ~10x speedup in our many-file nightly test, could you try the wheel for this specific commit?

https://docs.ray.io/en/master/ray-overview/installation.html#installing-from-a-specific-commit

The commit in question is cf3577f0ee8c09c4315def1593545487432fe3e2

This would be installed on Linux for Python 3.7 via: pip install https://s3-us-west-2.amazonaws.com/ray-wheels/master/cf3577f0ee8c09c4315def1593545487432fe3e2/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl

Read more comments on GitHub >

github_iconTop Results From Across the Web

[Datasets] [Bug] Piece Serialization with cloudpickle is very slow.
I tried to read many parquet files (~188k) from GCS. But I have an issue where cloudpickle.dumps() takes a long time (3h 30m)....
Read more >
Serialization — Ray 2.2.0 - the Ray documentation
Ray has decided to use a customized Pickle protocol version 5 backport to ... for you (fail to serialize some objects, too slow...
Read more >
Embarrassingly parallel for loops - Joblib - Read the Docs
The main drawback of cloudpickle is that it can be slower than the pickle module in ... If you wish to use the...
Read more >
cloudpickle - PyPI
cloudpickle makes it possible to serialize Python constructs not supported by the default pickle module from the Python standard library.
Read more >
Changelog — Dask.distributed 2022.12.1 documentation
In rare cases, this could make some workloads slower. ... Reverted a bug where Bokeh was accidentally made non-optional (GH#7230) Oliver Holworthy.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found