question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Read_parquet is slower than expected with S3

See original GitHub issue

I was looking at a read_parquet profile with @th3ed @ncclementi and @gjoseph92

Looking at this performance report: https://raw.githubusercontent.com/coiled/h2o-benchmarks/main/performance-reports-pyarr_str-50GB/q1_50GB_pyarr.html I see the following analysis (two minute video): https://www.loom.com/share/4c8ad1c5251a4e658c1c47ee2113f34a

We’re spending only about 20-25% of our time reading from S3, and about 5% of our time converting data to Pandas. We’re spending a lot of our time doing something else.

@gjoseph92 took a look at this with pyspy and generated reports like the following: tls-10_0_0_177-42425.json

I’m copying a note from him below:

What you’ll see from this is that pyarrow isn’t doing the actual reads. Because dask uses s3fs, the C++ arrow code has to call back into Python for each read. Ultimately, the reads are actually happening on the fsspec event loop (see the fsspecIO thread in profiles). If we look there, about 40% of CPU time is spent waiting for something (aka data from S3, good), but 60% is spent doing stuff in Python (which I’d consider overhead, to some degree).

We can also see that 30% of the total time is spent blocking on Python’s GIL (all the pthread_cond_timedwaits) (look at the functions calling into this and the corresponding lines in the Python source if you don’t believe me; they’re all Py_END_ALLOW_THREADS). This is an issue known as the convoy effect: https://bugs.python.org/issue7946, https://github.com/dask/distributed/issues/6325.

My takeaway is that using fsspec means dask is using Python for reads, which might be adding significant overhead / reducing parallelism due to the GIL.

I’d be interested in doing a comparison by hacking together a version that bypasses fsspec, and uses pyarrow’s native S3FileSystem directly. Before that though, it might be good to get some baseline numbers on how fast we can pull the raw data from S3 (just as bytes), to understand what performance we can expect.

FYI I also tried https://developer.nvidia.com/blog/optimizing-access-to-parquet-data-with-fsspec/, but it was ~2x slower. Haven’t tried repeating that though, so not sure if it’s a real result.

One other thing I find surprising is that polars appears to be using fsspec for reads as well, rather than the native S3FileSystem or GCSFileSystem: https://github.com/pola-rs/polars/blob/445c550e8f965d9e8f2da1cb2d01b6c15874f6c8/py-polars/polars/io.py#L949-L956 https://github.com/pola-rs/polars/blob/445c550e8f965d9e8f2da1cb2d01b6c15874f6c8/py-polars/polars/internals/io.py#L114-L121

I would have expected polars and dask read performance to be closer in this case. We should probably confirm for ourselves that they’re not.

It looks like we could make things a lot faster. I’m curious about the right steps to isolate the problem further.

cc’ing @martindurant @rjzamora @ritchie46 @fjetter

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:42 (39 by maintainers)

github_iconTop GitHub Comments

1reaction
mrocklincommented, Nov 18, 2022

Even if uvloop solved the problem I would still push for this change. Many people don’t use uvloop and if we can give those people a 2x speedup for presumably no cost then we should.

1reaction
rjzamoracommented, Nov 3, 2022

I’d be interested in doing a comparison by hacking together a version that bypasses fsspec, and uses pyarrow’s native S3FileSystem directly. Before that though, it might be good to get some baseline numbers on how fast we can pull the raw data from S3 (just as bytes), to understand what performance we can expect.

Note that you should already be able to do this by passing open_file_options={"open_file_func": <pyarrow-file-open-func>} to dd.read_parquet. For example:

import dask.dataframe as dd
import pyarrow as pa
import pyarrow.fs as pa_fs

path = "s3://ursa-labs-taxi-data/2009/01/data.parquet"
fs = pa_fs.S3FileSystem(anonymous=True)

ddf = dd.read_parquet(
    path,
    engine="pyarrow",
    storage_options={"anon": True},
    open_file_options={
        "open_file_func": fs.open_input_file,
    },
)

ddf.partitions[0].compute()

Using fs.open_input_file does cut my wall time by ~50% for this simple example.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why is AWS athena slower than reading parquet directly?
The AWS Athena method takes 6 seconds, while read_parquet takes 2 seconds. I thought Athena was faster than reading parquet directly, not 3 ......
Read more >
Troubleshoot slow or inconsistent speeds when downloading ...
When I download from or upload to Amazon S3 from a specific network or machine, my requests might get higher latency. How can...
Read more >
Spark Databricks ultra slow read of parquet files : r/apachespark
s3 is occasionally good at hiding throttling from the user. theres a couple things that could be done to help root cause this....
Read more >
awswrangler.s3.read_parquet - Read the Docs
safe (bool, default True) – For certain data types, a cast is needed in order to store the data in a pandas DataFrame...
Read more >
pd.read_parquet: Read Parquet Files in Pandas - Datagy
When working with large datasets, using Parquet files can still run slower than anticipated. When using the Pandas read_parquet() function to ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found