Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

s3.read_csv slow with chunksize

See original GitHub issue

Describe the bug

I’m not sure the s3.read_csv function really reads a csv in chunks. I noticed that for relatively big dataframes, running the following instruction takes an abnormally large amount of time:

it = wr.s3.read_csv(uri, chunksize=chunksize)

I think the chunksize parameter is ignored.

To Reproduce

I’m running awswrangler==1.1.2 (installed with poetry) but I quickly tested 1.6.3 and it seems the issue is there too.

from itertools import islice
from smart_open import open as sopen
import awswrangler as wr
import pandas as pd
from io import StringIO

uri = ""

CHUNKSIZE = 100

def manual_chunking(uri: str, chunksize: int = CHUNKSIZE) -> pd.DataFrame:
    with sopen(uri, "r") as f:
        chunk = "".join(islice(f, chunksize))
        df = pd.read_csv(StringIO(chunk))

    return df


def s3_chunking(uri: str, chunksize: int = CHUNKSIZE) -> pd.DataFrame:
    it = wr.s3.read_csv(uri, chunksize=chunksize)

    df = next(it)

    return df

I compared two different ways to load the first 100 lines of a “big” (1.2 GB) dataframe from S3:

with the equivalent of open(file, "r") and then lazily parsing the lines as a CSV string
using s3.read_csv with chunksize=100.

Results:

In [3]: %time manual_chunking(uri)
CPU times: user 173 ms, sys: 22.9 ms, total: 196 ms
Wall time: 581 ms

In [8]: %time s3_chunking(uri)
CPU times: user 8.73 s, sys: 7.82 s, total: 16.5 s
Wall time: 3min 59s

In [9]: %time wr.s3.read_csv(uri)
CPU times: user 27.3 s, sys: 9.48 s, total: 36.7 s
Wall time: 3min 38s

The timings are more or less reproducible. After comparing the last two timings, I suspect that the chunksize parameter is ignored. It takes more or less the same amount of time to load 100 lines of the file than to read the full file.

Is it expected?

Issue Analytics

State:
Created 3 years ago
Comments:15 (15 by maintainers)

Top GitHub Comments

1reaction

igorborgestcommented, Jul 30, 2020

Released in 1.7.0!

1reaction

igorborgestcommented, Jul 21, 2020

I read that it “will override the regular default arguments configured in the function signature.” though?

Yes, it was the only original behavior, but now I will update and mention that it can also set internal/not exposed configurations. Does it make sense?