Read_parquet is slower than expected with S3
See original GitHub issueI was looking at a read_parquet profile with @th3ed @ncclementi and @gjoseph92
Looking at this performance report: https://raw.githubusercontent.com/coiled/h2o-benchmarks/main/performance-reports-pyarr_str-50GB/q1_50GB_pyarr.html I see the following analysis (two minute video): https://www.loom.com/share/4c8ad1c5251a4e658c1c47ee2113f34a
We’re spending only about 20-25% of our time reading from S3, and about 5% of our time converting data to Pandas. We’re spending a lot of our time doing something else.
@gjoseph92 took a look at this with pyspy and generated reports like the following: tls-10_0_0_177-42425.json
I’m copying a note from him below:
What you’ll see from this is that pyarrow isn’t doing the actual reads. Because dask uses s3fs, the C++ arrow code has to call back into Python for each read. Ultimately, the reads are actually happening on the fsspec event loop (see the
fsspecIO
thread in profiles). If we look there, about 40% of CPU time is spent waiting for something (aka data from S3, good), but 60% is spent doing stuff in Python (which I’d consider overhead, to some degree).
We can also see that 30% of the total time is spent blocking on Python’s GIL (all the
pthread_cond_timedwait
s) (look at the functions calling into this and the corresponding lines in the Python source if you don’t believe me; they’re allPy_END_ALLOW_THREADS
). This is an issue known as the convoy effect: https://bugs.python.org/issue7946, https://github.com/dask/distributed/issues/6325.
My takeaway is that using fsspec means dask is using Python for reads, which might be adding significant overhead / reducing parallelism due to the GIL.
I’d be interested in doing a comparison by hacking together a version that bypasses fsspec, and uses pyarrow’s native S3FileSystem directly. Before that though, it might be good to get some baseline numbers on how fast we can pull the raw data from S3 (just as bytes), to understand what performance we can expect.
FYI I also tried https://developer.nvidia.com/blog/optimizing-access-to-parquet-data-with-fsspec/, but it was ~2x slower. Haven’t tried repeating that though, so not sure if it’s a real result.
One other thing I find surprising is that polars appears to be using fsspec for reads as well, rather than the native S3FileSystem or GCSFileSystem: https://github.com/pola-rs/polars/blob/445c550e8f965d9e8f2da1cb2d01b6c15874f6c8/py-polars/polars/io.py#L949-L956 https://github.com/pola-rs/polars/blob/445c550e8f965d9e8f2da1cb2d01b6c15874f6c8/py-polars/polars/internals/io.py#L114-L121
I would have expected polars and dask read performance to be closer in this case. We should probably confirm for ourselves that they’re not.
It looks like we could make things a lot faster. I’m curious about the right steps to isolate the problem further.
cc’ing @martindurant @rjzamora @ritchie46 @fjetter
Issue Analytics
- State:
- Created a year ago
- Comments:42 (39 by maintainers)
Top GitHub Comments
Even if uvloop solved the problem I would still push for this change. Many people don’t use uvloop and if we can give those people a 2x speedup for presumably no cost then we should.
Note that you should already be able to do this by passing
open_file_options={"open_file_func": <pyarrow-file-open-func>}
todd.read_parquet
. For example:Using
fs.open_input_file
does cut my wall time by ~50% for this simple example.