Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Read_parquet is slower than expected with S3

See original GitHub issue

I was looking at a read_parquet profile with @th3ed @ncclementi and @gjoseph92

Looking at this performance report: https://raw.githubusercontent.com/coiled/h2o-benchmarks/main/performance-reports-pyarr_str-50GB/q1_50GB_pyarr.html I see the following analysis (two minute video): https://www.loom.com/share/4c8ad1c5251a4e658c1c47ee2113f34a

We’re spending only about 20-25% of our time reading from S3, and about 5% of our time converting data to Pandas. We’re spending a lot of our time doing something else.

@gjoseph92 took a look at this with pyspy and generated reports like the following: tls-10_0_0_177-42425.json

I’m copying a note from him below:

What you’ll see from this is that pyarrow isn’t doing the actual reads. Because dask uses s3fs, the C++ arrow code has to call back into Python for each read. Ultimately, the reads are actually happening on the fsspec event loop (see the fsspecIO thread in profiles). If we look there, about 40% of CPU time is spent waiting for something (aka data from S3, good), but 60% is spent doing stuff in Python (which I’d consider overhead, to some degree).

We can also see that 30% of the total time is spent blocking on Python’s GIL (all the pthread_cond_timedwaits) (look at the functions calling into this and the corresponding lines in the Python source if you don’t believe me; they’re all Py_END_ALLOW_THREADS). This is an issue known as the convoy effect: https://bugs.python.org/issue7946, https://github.com/dask/distributed/issues/6325.

My takeaway is that using fsspec means dask is using Python for reads, which might be adding significant overhead / reducing parallelism due to the GIL.

I’d be interested in doing a comparison by hacking together a version that bypasses fsspec, and uses pyarrow’s native S3FileSystem directly. Before that though, it might be good to get some baseline numbers on how fast we can pull the raw data from S3 (just as bytes), to understand what performance we can expect.

FYI I also tried https://developer.nvidia.com/blog/optimizing-access-to-parquet-data-with-fsspec/, but it was ~2x slower. Haven’t tried repeating that though, so not sure if it’s a real result.

One other thing I find surprising is that polars appears to be using fsspec for reads as well, rather than the native S3FileSystem or GCSFileSystem: https://github.com/pola-rs/polars/blob/445c550e8f965d9e8f2da1cb2d01b6c15874f6c8/py-polars/polars/io.py#L949-L956 https://github.com/pola-rs/polars/blob/445c550e8f965d9e8f2da1cb2d01b6c15874f6c8/py-polars/polars/internals/io.py#L114-L121

I would have expected polars and dask read performance to be closer in this case. We should probably confirm for ourselves that they’re not.

It looks like we could make things a lot faster. I’m curious about the right steps to isolate the problem further.

cc’ing @martindurant @rjzamora @ritchie46 @fjetter

Issue Analytics

State:
Created a year ago
Comments:42 (39 by maintainers)

Top GitHub Comments

1reaction

mrocklincommented, Nov 18, 2022

Even if uvloop solved the problem I would still push for this change. Many people don’t use uvloop and if we can give those people a 2x speedup for presumably no cost then we should.

1reaction

rjzamoracommented, Nov 3, 2022

I’d be interested in doing a comparison by hacking together a version that bypasses fsspec, and uses pyarrow’s native S3FileSystem directly. Before that though, it might be good to get some baseline numbers on how fast we can pull the raw data from S3 (just as bytes), to understand what performance we can expect.

Note that you should already be able to do this by passing open_file_options={"open_file_func": <pyarrow-file-open-func>} to dd.read_parquet. For example:

import dask.dataframe as dd
import pyarrow as pa
import pyarrow.fs as pa_fs

path = "s3://ursa-labs-taxi-data/2009/01/data.parquet"
fs = pa_fs.S3FileSystem(anonymous=True)

ddf = dd.read_parquet(
    path,
    engine="pyarrow",
    storage_options={"anon": True},
    open_file_options={
        "open_file_func": fs.open_input_file,
    },
)

ddf.partitions[0].compute()

Using fs.open_input_file does cut my wall time by ~50% for this simple example.